0% found this document useful (0 votes)

43 views447 pages

Data Analysis and Optimization

Uploaded by

tsoiho.hku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views447 pages

Data Analysis and Optimization

Uploaded by

tsoiho.hku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Springer Optimization and Its Applications 202

Boris Goldengorin
Sergei Kuznetsov Editors

Data Analysis and

Optimization
In Honor of Boris Mirkin’s 80th Birthday
Springer Optimization and Its Applications

Volume 202

Series Editors
Panos M. Pardalos , University of Florida, Gainesville, FL, USA
My T. Thai , CSE Building, University of Florida, Gainesville, FL, USA

Advisory Editors
Roman V. Belavkin, Faculty of Science and Technology, Middlesex University,
London, UK
John R. Birge, University of Chicago, Chicago, IL, USA
Sergiy Butenko, Texas A&M University, College Station, TX, USA
Vipin Kumar, Dept Comp Sci & Engg, University of Minnesota, Minneapolis, MN,
USA
Anna Nagurney, Isenberg School of Management, University of Massachusetts
Amherst, Amherst, MA, USA
Jun Pei, School of Management, Hefei University of Technology, Hefei, Anhui,
China
Oleg Prokopyev, Department of Industrial Engineering, University of Pittsburgh,
Pittsburgh, PA, USA
Steffen Rebennack, Karlsruhe Institute of Technology, Karlsruhe, Baden-
Württemberg, Germany
Mauricio Resende, Amazon (United States), Seattle, WA, USA
Tamás Terlaky, Lehigh University, Bethlehem, PA, USA
Van Vu, Department of Mathematics, Yale University, New Haven, CT, USA
Michael N. Vrahatis, Mathematics Department, University of Patras, Patras, Greece
Guoliang Xue, Ira A. Fulton School of Engineering, Arizona State University,
Tempe, AZ, USA
Yinyu Ye, Stanford University, Stanford, CA, USA

Honorary Editor
Ding-Zhu Du, University of Texas at Dallas, Richardson, TX, USA
Aims and Scope
Optimization has continued to expand in all directions at an astonishing rate. New
algorithmic and theoretical techniques are continually developing and the diffusion
into other disciplines is proceeding at a rapid pace, with a spot light on machine
learning, artificial intelligence, and quantum computing. Our knowledge of all
aspects of the field has grown even more profound. At the same time, one of the
most striking trends in optimization is the constantly increasing emphasis on the
interdisciplinary nature of the field. Optimization has been a basic tool in areas
not limited to applied mathematics, engineering, medicine, economics, computer
science, operations research, and other sciences.
The series Springer Optimization and Its Applications (SOIA) aims to publish
state-of-the-art expository works (monographs, contributed volumes, textbooks,
handbooks) that focus on theory, methods, and applications of optimization. Topics
covered include, but are not limited to, nonlinear optimization, combinatorial opti-
mization, continuous optimization, stochastic optimization, Bayesian optimization,
optimal control, discrete optimization, multi-objective optimization, and more. New
to the series portfolio include Works at the intersection of optimization and machine
learning, artificial intelligence, and quantum computing.
Volumes from this series are indexed by Web of Science, zbMATH, Mathematical
Reviews, and SCOPUS.
Boris Goldengorin • Sergei Kuznetsov
Editors

Data Analysis and

Optimization
In Honor of Boris Mirkin’s 80th Birthday
Editors
Boris Goldengorin Sergei Kuznetsov
Department of Mathematics Faculty of Computer Science
New Uzbekistan University National Research University Higher
Tashkent, Uzbekistan School of Economics
Moscow, Russia
Center for Applied Optimization
University of Florida
Gainesville, FL, USA

ISSN 1931-6828 ISSN 1931-6836 (electronic)

Springer Optimization and Its Applications
ISBN 978-3-031-31653-1 ISBN 978-3-031-31654-8 (eBook)
https://doi.org/10.1007/978-3-031-31654-8

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Paper in this product is recyclable.

Preface

This book presents the state of the art in the emerging field of data science which
includes models for layered security, protection of large gathering sites, cancer
diagnostics, self-driving cars and other applications with catastrophic consequences
of wrong decisions. The manipulability of aggregation procedures for the case of
large numbers of voters is analyzed from a theoretical point of view and justified
by computational experiments involving at least an order of magnitude larger
number of voters. Many tree-type structures are considered: from phylogenetic
trees representing the main patterns of vertical descent through consensus trees
and supertrees widely used in evolutionary studies to combine phylogenetic
information contained in individual gene trees. The statistical part of this book
studies an impact of data mining and modeling on predictability assessment of time
series. New notions of mean values based on ideas of multicriteria optimization are
compared to their conventional definitions leading to fresh algorithmic approaches.
To summarize, the book presents methods for automated analysis of patterns
and models for data of different nature with applications ranging from scientific
discovery to business intelligence and analytics. The style of the written chapters
allows to recommend this book for senior undergraduate and graduate data mining
courses providing a broad yet in-depth review integrating novel concepts from
machine learning and statistics. The main parts of the book include exploratory data
analysis, pattern mining, clustering, and classification supported by real life case
studies.
Students and professionals specializing in computer and management science,
data mining for high-dimensional data, complex graphs, and networks will benefit
from many cutting-edge ideas and practically motivated case studies.
From a technical point of view, each paper has received at least two referee
reports, and almost all papers are presented at the International conference “Data
Analysis, Optimization and Their Applications” on the occasion of Boris Mirkin’s
80th birthday, January, 30–31, 2023. Dolgoprudny, Moscow Region, Moscow
Institute of Physics and Technology.
With the purpose to keep the atmosphere of that conference, we have included
the Book of Abstracts and its schedule. We would like to thank all speakers and

v
vi Preface

contributors to the conference and volume dedicated to Boris Mirkin’s 80th birthday.
Without the crucial support on organizational issues, the conference and the book
could not have become a reality. Here is the list of our supporters and reviewers:
Fuad Aleskerov, NRU HSE, Russia
Daniel Berend, Ben Gurion University, Israel
Jean Diatta, University of la Réunion, France
Trevor Fenner, University of London, UK
Sergei Frenkel, FRC Computer Science and Control, Russia
Alexander Gasnikov, MIPT, Russia
Boris Goldengorin, University of Florida, USA
Eugene Koonin, NIH/NLM/NCBI, USA
Sergei Kuznetsov, NRU HSE, Russia
Boris Kovalerchuk, Central Washington University, USA
Leonid Litinskii, ISA RAS, Russia
Vladimir Makarenkov, University of Quebec in Montreal, Canada
Magomed Malsagov, ISA RAS, Russia
Boris Mirkin, NRU HSE, Russia and UL London, UK
Susana Nascimento, New University of Lisbon, Portugal
Alexandre Orlov, Bauman Moscow State Technical University, Russia
Panos Pardalos, University of Florida, USA
Natalia Pavlova, MIPT, Russia
Andrei Raigorodskii, MIPT, Russia
Alexey Remizov, MIPT, Russia
Fred Roberts, Rutgers University, USA
Yuri Sidel’nikov, MAI, Russia
Konstantin Vorontsov, Lomonosov MSU, Russia

Tashkent, Uzbekistan Boris Goldengorin

Gainesville, FL, USA
Moscow, Russia Sergei Kuznetsov
Program and Abstract Book

International Conference “Data Analysis, Optimization

and Their Applications” on the Occasion of Boris Mirkin’s
80th Birthday

(January 30–31, 2023. Dolgoprudny, Moscow Region, Moscow Institute of Physics

and Technology)
https://mipt.ru/education/chairs/dm/conferences/data-analysis-optimization-
and-their-applications-2023.php
The conference is organized in honor of professor Boris Mirkin’s 80th birthday
to celebrate his contributions in the fields of Data Science that he introduced or
extensively explored: Anomalous Cluster, Bi-cluster, Categorical Factor Analysis,
Chain Order Partition, Complementary Partition Criterion, Core-Shell Cluster, Data
Recovery Approach, Distance Between Partitions, Federation Consensus Rule, Gen-
eralization in Taxonomy, Interval Order, Linear Stratification, Mapping Between
Evolutionary Trees, Minkowski Weighted Feature Clustering, Parsimonious Gene
History Reconstruction, Single Cluster Clustering, Structured Partition, Taxonomic
Rank of Results, Tri-clustering.

Organization Committee

Fuad Aleskerov, NRU HSE, Russia

Trevor Fenner, University of London, UK
Alexander Gasnikov, MIPT, Russia
Fred Roberts, Rutgers University, USA
Boris Goldengorin, University of Groningen, The Netherlands
Sergei Kuznetsov, NRU HSE,Russia, Co-Chair
Boris Kovalerchuk, Central Washington University, USA
Vladimir Makarenkov, University of Quebec in Montreal, Canada
Panos Pardalos, University of Florida, USA

vii
viii Program and Abstract Book

Andrei Raigorodskii, MIPT, Russia, Chair

Konstantin Vorontsov, Lomonosov MSU, Russia

Invited Speakers

Fuad Aleskerov, NRU HSE, Russia

Aleksandr Beznosikov, MIPT, Russia
Tendai Chikake, MIPT, Russia
Fred Roberts, Rutgers University, USA
Boris Goldengorin, University of Groningen, The Netherlands
Alexander Karpov, NRU HSE, Russia
Yakov Karandashev, RUDN, Russia
Eugene Koonin, National Center for Biotechnology Information, USA
Boris Kovalerchuk, Central Washington University, USA
Sergei Kuznetsov, NRU HSE,Russia
Alexander Lepskiy, NRU HSE, Russia
Guy Leshem. Ashkelon Academic College, Israel
Vladimir Makarenkov, University of Quebec in Montreal, Canada
Irina Maximova, RUDN, Russia
Boris Mirkin, NRU HSE, Russia and UL London, UK
Susana Nascimento, New University of Lisbon, Portugal
E. Dov Neimand, Stevens Institute of Technology, USA
Maria Pilgun, Russian State Social University, Russia
Maria Poptsova, NRU HSE, Russia
Alexander Rubchinsky, NRU HSE, Russia
Alexey Samosyuk, MIPT, Russia
Soroosh Shalileh, NRU HSE, Russia
Zina Taran, Delta State University, USA
Yuliya A. Veselova, NRU HSE, Russia
Konstantin Vorontsov, Lomonosov MSU, Russia
Program and Abstract Book ix

Schedule

Monday, 30 January 2023

Time Speaker, Table of Contents

10:30–10:55 Conference Opening
11:00–11:25 Sergei Kuznetsov
11:30–11:55 Boris Mirkin
12:00–12:25 Soroosh Shalileh
Coffee break
13:00–13:25 Guy Leshem
13:30–13:55 Yakov Karandashev
14:00–14:25 Maria Pilgun
Lunch
15:00–15:25 Konstantin Vorontsov
15:30–15:55 Maria Poptsova
16:0016:25 Fred Roberts
Coffee break
17:00–17:25 Eugene Koonin
17:30–17:55 Boris Goldengorin
18:00–18:25 Vladimir Makarenkov
19:00 Conference Dinner, https://theorybar.ru/

Tuesday, 31 January 2023

Time Speaker
11:00–11:25 Fuad Aleskerov
11:30–11:55 Alexander Lepskiy
12:00–12:25 Irina Maximova
Coffee break
13:00–13:25 Yuliya A Veselova
13:30–13:55 Tendai Chikake
14:00–14:25 Alexey Samosyuk
Lunch
15:00–15:25 Alexander Rubchinsky
15:30–15:55 Susana Nascimento
16:00–16:25 Dmitry Frolov
Coffee break
17:00–17:25 Aleksandr Beznosikov
17:30–17:55 E Dov Neimand
18:00–18:25 Boris Kovalerchuk
18:30–18:55 Alexander Karpov
x Program and Abstract Book

Biclustering, n-Clustering and Formal Concept Analysis

Authors Sergei Kuznetsov

Abstract The term bi-clustering was coined by Boris Mirkin to denote grouping objects
based not on a similarity or distance notion, but on commonality of shared
attributes, so that two types of entities – objects and attributes – are grouped at
the same type. Formal Concept Analysis, based on Galois connections and
lattice of closed sets, proposes a model of a strict bi-cluster – called (formal)
concept – where a subset of objects shares all attributes from a subset of binary
attributes. The order on concepts is closely related to implicational
dependencies, both of exact (like functional dependencies) and approximate
(liked association rules) nature. Several well-known types of bi-clusters can be
efficiently reduced to concepts, thus taking advantage of machinery developed
for processing concepts. We consider some useful relaxations of concepts and
their generalization to multidimensional and unstructured data.
Affiliation National Research University Higher School of Economics, Moscow, Russian
Federation
Contact [email protected]
Keywords Bi-cluster, n-cluster, formal concept, lattice of closed sets, dependency,
association rule

Clustering as Empirical Classification

Authors Boris Mirkin

Abstract Clustering can be considered within various frameworks: machine learning, data
science, systems analysis. We consider it part of the science of classification
established by Aristotle a while ago. Unfortunately, the science of classification
is not well developed yet, although one may distinguish among its goals and
structures. Classification goals include: domain structuring, relating different
aspects of phenomena, and knowledge representation. Classification structures
included those faceted, ranking, partition, typology, and hierarchy. We analyze
the current state of clustering research with respect to classification structures
and goals.
Affiliation NRU HSE, Moscow, RU and UL London, UK,
Contact [email protected]
Keywords Clustering, Classification, Clustering goals
Program and Abstract Book xi

Classification Using Marginalized Maximum Likelihood

Estimation and Black-Box Variational Inference

Authors Soroosh Shalileh

Abstract Based upon variational inference (VI), a new set of
classification algorithms has recently emerged. This set of
algorithms aims (A) to increase generalization power; (B)
to decrease computational complexity. However, the
complex math and implementation considerations have led
to the emergence of black-box variational inference
methods (BBVI). Relying on these principles, we assume
the existence of a set of latent variables during the
generation of data points. We subsequently marginalize the
conventional maximum likelihood objective function w.r.t
this set of latent variables and then apply black-box
variational inference to estimate the model’s parameters.
We evaluate the performance of the proposed method by
comparing the results obtained from the application of our
method to realworld and synthetic data sets with those
obtained using basic and state-ofart classification
algorithms. We proceed and scrutinize the impact: (1) the
existence of non-informative features at various
dimensionalities, (2) the imbalanced data representation,
(3) non-linear data sets, and (4) different data set size on
the performance of algorithms under consideration. The
results obtained prove to be encouraging and effective.
Affiliation Center for Language and Brain, NRU HSE University,
Moscow, Russian Federation
Contact [email protected]
Keywords Automatic differentiation, Classification with Black Box 25
Variational Inference, Variational Inference, Classification
xii Program and Abstract Book

Purifying Data by Machine Learning with Certainty Levels

Authors Shlomi Dolev, Guy Leshem

Abstract For autonomic computing, self-managing systems, and
decision-making under uncertainty and faults, in many cases we
are using machine learning models and combine them to solve
any problem. This models uses a dataset, or a set of data items,
and data item is a vector of feature values and a classification. In
many cases, these data sets include outlier and/or misleading data
items that were created by input device malfunctions or were
maliciously inserted to lead the machine learning to wrong
conclusions. A reliable machine learning model must be able to
handle a corrupted data set; otherwise, a malfunctioning input
device that corrupts a portion of the data set or malicious
adversary may lead to inaccurate classifications. Therefore, the
challenge is to find an effective method to evaluate and increase
the certainty level of the learning process as much as possible.
This work introduces the use of a certainty level measure to
obtain better classification capability in the presence of corrupted
or malicious data items. Assuming we know the data distribution,
e.g., is a normal distribution (which is a reasonable assumption in
a large amount of data items) and/or a known upper bound on the
given number of corrupted data items, our techniques define a
certainty level for classifications. Another approach that will be
presented in this work suggests enhancing the random forest
techniques (the original model was developed by Leo Breiman)
to cope with corrupted data items by augmenting the certainty
level for the classification obtained in each leaf in the forest. This
method is of independent interest that of significantly improving
the classification of the random forest machine learning
technique in less severe settings.
Affiliation Dolev Shlomi, Ben-Gurion University of the Negev, Israel;
Department of Computer Science Ashkelon Academic College,
Israel
Contact [email protected], [email protected]
Keywords Data corruption, PAC learning, Machine learning, Certainty level
Program and Abstract Book xiii

Anomaly Detection with Neural Network Using a Generator

Authors A.S. Markov, E.Yu. Kotlyarov, N.P. Anosova, V.A. Popov, Yakov
Karandashev, D.E. Apushkinskaya
Abstract This paper concerns with the problem of detecting anomalies on X-ray images
taken by full-body scanners (FBS). Our previous work describes the sequence of
image preprocessing methods used to convert the original images, which are
produced with FBS, to images with visually distinguishable anomalies. In this
paper, we focus on development of the proposed methods, including the
addition of preprocessing methods and the creation of generator which can
produce synthetic anomalies. Examples of processed images are given. The
results of using a neural network for anomaly detection are shown.
Affiliation Peoples’ Friendship University of Russia (RUDN University), Moscow, Russian
Federation
Contact [email protected],[email protected],[email protected],
[email protected],[email protected],[email protected]
Keywords Full-body scanner, X-ray image, anomaly detection, image histogram
equalization, generator, neural network, U-Net

Data and Text Interpretation in Social Media:

Urban Planning Conflicts

Authors Maria Pilgun, Nailia Gabdrakhmanova

Abstract The relevance of this study is determined by the need to develop technologies
for effective urban systems management and resolution of urban planning
conflicts. The paper presents an algorithm for analyzing urban planning
conflicts on the example of data and text interpretation in social media. The
material for the study was data from social networks, microblogging, blogs,
instant messaging, forums, reviews, video hosting services, thematic portals,
online media, print media, and TV related to the construction of the Big circle
metro line (Southern section) in Moscow (Russian Federation). Data collection:
1 October 2020–10 June 2021. Number of tokens: 62 657 289. To analyze the
content of social media, a multi-modal approach was used. The paper presents
the results of research on the development of methods and approaches for
constructing mathematical and neural network models for analyzing the social
media users’ perceptions based on the user generated content and on digital
footprints of users. Artificial neural networks, differential equations, and
mathematical statistics were involved in building the models. Differential
equations of dynamic systems were based on observations enabled by machine
learning. In combination with mathematical and neural network model, the
developed approaches made it possible to draw a conclusion about the tense
situation, identify complaints of residents to constructors and city authorities,
and propose recommendations to resolve and prevent conflicts.
Affiliation Russian State Social University, Moscow, Russian Federation, Peoples’
Friendship University of Russia, Moscow, Russia, Russian Federation
Contact [email protected],[email protected]
Keywords Time series, neural networks, stochastic process, differential equation, urban
environment, social tension, social media
xiv Program and Abstract Book

Rethinking Probabilistic Topic Modeling from the Point

of View of Classical Non-Bayesian Regularization

Authors Konstantin Vorontsov

Abstract Probabilistic topic modeling with hundreds of its models and
applications has been an efficient text analysis technique for almost
twenty years. This research area has evolved mostly within the
frame of the Bayesian learning theory. For a long time, the
possibility of learning topic models with a simpler conventional
(non-Bayesian) regularization remained underestimated and rarely
used. The framework of additive regularization for topic modeling
(ARTM) fills this gap. It dramatically simplifies the model inference
and opens up new possibilities for combining topic models by just
adding their regularizers. This makes the ARTM a tool for
synthesizing models with desired properties and gives rise to
developing the fast online algorithms in the BigARTM open-source
environment equipped with a modular extensible library of
regularizers. In this paper, a general iterative process is proposed
that maximizes a smooth function on unit simplices. This process
can be used as inference mechanism for a wide variety of topic
models. This approach is believed to be useful not only for
rethinking probabilistic topic modeling, but also for building the
neural topic models increasingly popular in recent years.
Affiliation Federal Research Center “Computer Science and Control” of RAS,
M.V.Lomonosov Moscow State University (MSU), Moscow,
Russian Federation
Contact [email protected]
Keywords Probabilistic Topic Modeling, Additive Regularization of Topic
Models, EM-algorithm, BigARTM, Multimodal Topic Modeling,
Hierarchical Topic Modeling, Hypergraph Topic Modeling,
Sequential Topic Modeling, Topical Embedding, Transactional
Data, Recommender Systems, Latent Dirichlet Allocation, Bayesian
Learning.
Program and Abstract Book xv

Generating Genomic Maps of Z-DNA with the Transformer

Algorithm

Authors Dmitry Umerenkov, Vladimir Kokh, Alan Herbert, Maria Poptsova

Abstract Z-DNA and Z-RNA were shown to play an important role in various processes
of genome functioning acting as flipons that launch or suppress genetic
programs. Genome-wide experimental detection of Z-DNA remains a challenge
due to dynamic nature of its formation. Recently we developed a deep learning
approach DeepZ, based on CNN and RNN architectures, that predicts Z-DNA
regions using additional information from omics data collected from different
cell types. Here we took advantage of the transformer algorithm that trains
attention maps to improve classifier performance. We started with pretrained
DNABERT models and fine-tuned their performance by training with
experimental Z-DNA regions from mouse and human genome wide studies. The
resulting DNABERT-Z outperformed DeepZ. We demonstrated that
DNABERT-Z finetuned on human data sets also generalizes to predict Z-DNA
sites in mouse genome.
Affiliation Sber Artificial Intelligence Lab, Moscow, Russian Federation Laboratory of
Bioinformatics, Faculty of Computer Science, HSE University, Moscow,
Russian Federation, InsideOutBio, Charlestown, MA, USA
Contact [email protected]
Keywords non-B DNA structures, machine learning, deep learning, transformer, Z-DNA

Graph-Theoretical Models of the Spread and Control

of Disease and of Fighting Fires

Authors Fred S. Roberts

Abstract We will describe irreversible threshold processes on graphs that model the
spread of disease and lead to insights about strategies for vaccination,
quarantine, etc.; we will describe models of the control of fires that are
mathematically analogous to the disease spread models. The analogy will lead
to insights about both types of processes and a variety of challenging
graph-theoretical problems.
Affiliation DIMACS, Rutgers University Piscataway, NJ USA
Contact [email protected]
Keywords Graph-theoretical models, Disease spread models, Vaccination models,
Threshold processes, The firefighter problem
xvi Program and Abstract Book

The Last Universal Cellular Ancestor: What Have We

Learned After 20 Years of Effort?

Authors Eugene Koonin

Abstract In 2003, Boris Mirkin and colleagues published a seminal on the
Last Universal Cellular Ancestor (LUCA). In this work, a
modified maximum parsimony approach was developed and
applied to reconstruct the genome of the LUCA from a mapping
of genes that are conserved across different ranges of extant
organism on the universal phylogenetic tree. The size and
composition of the reconstructed gene sets of the LUCA
critically depended on the key parameter, namely, the gain
penalty, or the ratio of the rates of gene gain, via emergence of
new genes and horizontal gene transfer, to the rate of gene loss.
As it can be expected, the size of the reconstructed gene sets grew
with increasing gain penalty (g) such that the number of genes
mapped to the LUCA varied from less than 600 for g=1 (gene
gains considered as frequent as gene gains) to about 1700 for
g=10 (gains considered 10 times less frequent than losses).
During the 20 years elapsed since the publication of this work,
several attempts to reconstruct the gene set of the LUCA using
more sophisticated algorithms and many more genomes of
prokaryotes have been undertaken, but the results of Mirkin and
colleagues do not appear to have been superseded. At the time of
the original publication, the high rate of horizontal gene transfer
in bacteria and archaea had been just discovered, and a low gain
penalty, leading to a simple LUCA, appeared most plausible.
Subsequently, however a confluence of biological considerations
was increasingly pointing towards a LUCA that was comparable
in complexity to modern bacteria and archaea. The
reconstructions obtained by Mirkin and colleagues in 2003 with
the largest gain penalties might have been surprisingly accurate,
at least in terms of the number of genes in the LUCA genome.
Affiliation National Center for Biotechnology Information, National Library
of Medicine, National Institutes of Health, Bethesda, MD, USA
Contact [email protected]
Keywords Last Universal Cellular Ancestor, Phylogenetic Trees, Maximum
Parsimony, Gene Gain, Gene Loss
Program and Abstract Book xvii

Pseudo-Boolean Polynomials, Dilworth Theorem and Data

Aggregation

Authors Boris Goldengorin

Abstract In this talk, we use a pseudo-Boolean representation of the given
matrix (two-dimensional table) with the purpose to aggregate
(compress) its columns preserving their monotone behavior [2]. A
natural application of Dilworth theorem (1950) [1] to the simplified
polynomial returns the mi-nimum number of aggregated columns.
Further, matrix compactification can be done by truncation columns
depending on the number of rows fairly representing the original
data [3]. We show that truncated pseudo-Boolean polynomials are
invariants to represent clusters of equivalent data (matrices). The
presented approach might be applied to 3- and multi-dimensional
data sets. Promising preliminary experiments with the
4-dimensional Iris Flower dataset, 30-dimensional Wisconsin
diagnostic breast cancer (WDBC), edge/blob detection, contour
analysis, image segmentation and optical character recognition
(OCR) are discussed in the talk of Tendai Chikake [4].
References
1. Robert P. Dilworth. A Decomposition Theorem for Partially
Ordered Sets, Annals of Mathematics, 1950, 51 (1): 161–166,
doi:10.2307/1969503.
2. B. F. AlBdaiwi, D. Ghosh, B. Goldengorin. Data Aggregation for
p-Median Problems. Journal of Combinatorial Optimization, 2011,
3(21), 348-363.
3. B.Goldengorin, D.Krushinsky, P.M. Pardalos. Cell Formation in
Industrial Engineering. Theory and Computational Experiments.
Springer, 2013 (chapter 2).
4. B.Goldengorin, T. Chikake. Pseudo-Boolean polynomials (pBp)
for dimensionality reduction and image processing. Book of
Abstracts. International conference “Data Analysis, Optimization
and their Applications” on the occasion of Boris Mirkin’s 80th
birthday. MIPT, 2023. Page 23
Affiliation University of Florida, USA; University of Groningen, NL and
MIPT, Russian Federation
Contact [email protected]
Keywords Pseudo-Boolean Polynomials, Monotonicity, Data Aggregation,
Dilworth Theorem
xviii Program and Abstract Book

Inferring Multiple Consensus Trees and Super-Trees Using

Clustering: A Review

Authors Vladimir Makarenkov, Gayane S Barseghyan, Nadia Tahiri

Abstract Phylogenetic trees (i.e. evolutionary trees, additive trees or
X-trees) play a
key role in the processes of modeling and representing species
evolution. Genome evolution of a given group of species is
usually modeled by a species phylogenetic tree that represents
the main patterns of vertical descent. However, the evolution of
each gene is unique. It can be represented by its own gene tree
which can differ substantially from a general species tree
representation. Consensus trees and supertrees have been widely
used in evolutionary studies to combine phylogenetic information
contained in individual gene trees. Nevertheless, if the available
gene trees are quite different, then the resulting consensus tree or
supertree can either include many unresolved subtrees
corresponding to internal nodes of high degree or can simply be a
star tree. This may happen if the available gene trees have been
affected by different reticulate evolutionary events, such as
horizontal gene transfer, hybridization or genetic recombination.
In this case, the problem of inferring multiple alternative
consensus trees or supertrees, using clustering, becomes relevant
since it allows one to regroup in different cluster gene trees
having similar evolutionary patterns (e.g. gene trees representing
genes that have undergone the same horizontal gene transfer or
recombination events). We critically review recent advances and
methods in the field of phylogenetic tree clustering, discuss the
methods’ mathematical properties and describe the advantages
and limitations of multiple consensus trees and supertree
approaches. In the application section, we show how the
discussed supertree clustering approach can be used to cluster
aaRS evolutionary trees according to their evolutionary patterns.
Affiliation Département d’Informatique, Université du Québec à Montréal,
Case postale 8888, Succursale Centre-ville, Montreal, QC, H3C
3P8, Canada
Département d’Informatique, Université de Sherbrooke, 2500
Boulevard de l’Université, Sherbrooke, Québec J1K 2R1, Canada
Contact [email protected]
Keywords Clustering, Cluster validity index, Consensus tree, k-means,
k-medoids, Phylogenetic tree, Robinson and Foulds distance,
Supertree
Program and Abstract Book xix

The Model of Tunnel Clustering and Its Applications

Authors Fuad Aleskerov, Darya Chubarova , Vyacheslav Yakuba

Abstract We propose a man-machine procedure for dynamic pattern analysis.
This very model allows us to analyze the changes of clusters over
time of objects of different nature based on ε-tube defined on the
parameters of objects. We present several examples of a dynamic
pattern-analysis; in particular, we consider the oil export/import
operations among countries in 2005–2020.
Affiliation HSE University, Moscow, Russian Federation, Institute of Control
Sciences of Russian Academy of Sciences, Moscow, Russian
Federation, HSE University, Moscow, Russia, Institute of Control
Sciences of Russian Academy of Sciences and HSE University,
Moscow, Russian Federation
Contact [email protected],[email protected],[email protected]
Keywords Pattern-analysis, Clustering, Dynamic pattern-analysis, Oil
export/import

About Some Clustering Algorithms in Evidence Theory

Authors Alexander Lepskiy

Abstract The Dempster–Shafer theory of evidence considers data that have a
frequency-set nature (the so-called body of evidence). In recent years, there has
been interest in clustering such objects to approximate them with simpler bodies
of evidence, to analyze the inconsistency of information, reducing the
computational complexity of processing algorithms, revealing the structure of
the set of focal elements, etc. The article discusses some existing algorithms for
clustering evidence bodies and suggests some new algorithms and approaches in
such clustering.
Affiliation National Research University Higher School of Economics (HSE), Moscow,
Russian Federation
Contact [email protected]
Keywords Evidence Theory, Conflict Measure, Clustering Bodies of Evidence
xx Program and Abstract Book

Controllability of Triangular Systems with Phase Space

Change

Authors Irina Maximova

Abstract In the present paper, the controllability of a composite system of the following
structure is investigated: two phase spaces and two consecutive time intervals
are given, in each space on the corresponding time interval the motion of the
object is described by a nonlinear system. The phase spaces are changed with
the help of some given mapping, and the docking of trajectories is also
connected with it. The conditions of controllability of this system from the
initial set of one space to the finite set of another space are obtained in the paper.
An approach to finding trajectories for this motion is proposed.
Affiliation S. M. Nikolsky Mathematical Institute, Peoples’ friendship university of Russia
(RUDN University), 6 Miklukho-Maklaya St, Moscow, 117198, Russian
Federation
Contact [email protected]
Keywords Controllability, full controllability, phase space change, triangular system.

Manipulation by Coalitions in Voting with Incomplete

Information

Authors Yuliya A Veselova

Abstract We consider the problem of coalitional manipulation in collective decision
making and a probabilistic approach for solving it. We assume that voters have
some information about other voters’ preferences from opinion polls held before
voting. There are five different types of poll information functions. Coalition
members are assumed to have identical preferences. We consider the probability
that in a randomly chosen preference profile there exists a coalition which has
an incentive to manipulate under a given type of poll information. We answer
the following questions. How does coalitional manipulability differ from
individual? How do different types of poll information affect coalitional
manipulability? We answer these questions via both theoretical investigation
and computational experiments.
Affiliation National Research University Higher School of Economics (HSE University),
Moscow, Russian Federation, Institute of Control Sciences of Russian Academy
of Sciences, Moscow, Russian Federation
Contact [email protected]
Keywords Coalitions, manipulation, manipulability index, opinion poll, incomplete
information
Program and Abstract Book xxi

Pseudo-Boolean Polynomials (pBp) for Dimensionality

Reduction and Image Processing

Authors Tendai Chikake, Boris Goldengorin

Abstract We introduce usage of a reduction property of penalty-based formulation of
pseudo-Boolean polynomials as a mechanism for invariant dimensionality
reduction in cluster analysis processes. In our experiments, we show that
multidimensional data, like a 4-dimensional Iris Flower dataset, can be reduced
to 2-dimensional space while a 30-dimensional Wisconsin Diagnostic Breast
Cancer (WDBC) dataset can be reduced to 3-dimensional space, and through
linear edge searches we can extract clusters in a linear and unbiased manner
with competitive accuracies, reproducibility, and clear interpretation.
We further showcase the exploitation of equivalence and polynomial degree for
detecting gradient change in image data represented in greyscale matrices.
These properties enable us to propose new methods of edge/blob detection,
contour analysis, image segmentation and optical character recognition (OCR)
using combinatorial techniques.
Affiliation Department of Discrete Mathematics, Phystech School of Applied Mathematics
and Informatics, Moscow Institute of Physics and Technology (MIPT)
Dolgoprudny, Moscow Region, Russian Federation
Contact [email protected], [email protected]
Keywords Pseudo-Boolean polynomials, equivalence, dimensionality reduction, cluster
analysis, feature selection, outlier detection, interpretable clustering, edge/blob
detection, contour analysis, image segmentation, OCR

Clicks, Random Graphs and Neural Networks

Authors Alexey Samosyuk, Shokorov Viacheslav

Abstract In this work, we present several computational experiments that demonstrate
applications the Erdős-Rényi graph model and the Johnson–Lindenstrauss
lemma to the evolution of neural networks during training and fine-tuning on the
academia- and industry-level datasets (MS1M, WebFace42) and models
(ShuffleNetv2, ResNet200). We show this on the 2 million classes FaceID
problem, followed by the domain adaptation step.
Affiliation Moscow Institute of Physics and Technology (MIPT) Dolgoprudny, Moscow
Region, Russian Federation
Contact [email protected], [email protected]
Keywords Neural Networks, Explainable, Random Graphs, Domain Adaptation
xxii Program and Abstract Book

Algorithm of Trading on the Stock Market, Providing

Satisfactory Results

Authors Alexander Rubchinsky, Kristina Baikova

Abstract The paper proposes a new trading algorithm for S&P-500 stock market, which
provides positive results over a sufficiently long period of time. No assumptions
are made about the behave or of this market (including probabilistic ones), and
no predictions are made and used. The daily real stock price data are considered,
and the gain (or loss) that would be obtained if the proposed stock choice
algorithm for the next day was applied is calculated. The algorithm uses only
the closing price data for the preceding days and includes a special stopping rule
based on the income, accumulated since an initial day of a considered period till
a current day. The suggested algorithm substantially uses the previously
developed approach to construction a family of graph decompositions (see
publications [1−3]).
Affiliation National Research University HSE, Moscow, Russian Federation, National
University of Science and Technology (MISIS), Moscow, Russian Federation
Contact [email protected], [email protected]
Keywords Stock market, graph decomposition, trading algorithm, cluster, stopping rule,
secretary problem, fractal
Program and Abstract Book xxiii

Three-Stage Cluster Modeling for the Spatiotemporal

Analysis of Coastal Upwelling

√
Authors Susana Nascimento, Alexandre Martins, Paulo Relvas, Joaquim F. Lu =s,
Boris Mirkin
Abstract This work proposes a three-stage spatiotemporal clustering approach for the
automatic recognition and analysis of coastal upwelling from Sea Surface
Temperature (SST) grids derived from satellite images.
The algorithm, core-shell clustering, models the upwelling as an evolving
cluster whose core points are constant during a certain time window while the
shell points move through an in-and-out binary sequence. The least squares
minimization of clustering criterion allows to derive key parameters in an
automated way.
The algorithm is initialized with an extension of Seeded Region Growing
offering self-tuning thresholding, the STSEC algorithm, that is able to precisely
delineate the upwelling regions at each SST instant grid. Yet, the application of
STSEC to the SST grids as temporal data puts the business of finding relatively
stable “time windows”, here called “time ranges”, for obtaining the core clusters
onto an automated footing.
The approach successfully applies to SST image data for sixteen successive
years of coastal upwelling of the Canary Current Upwelling System, covering
two distinct regions: the Portuguese coast and the Morocco coast. The extracted
time series of upwelling features presented consistent regularities among the
upwelling seasons.
Affiliation Dep. of Computer Science and NOVA Laboratory for Computer Science and
Informatics (NOVA-LINCS) Faculdade de Ciências e Tecnologia, Universidade
Nova de Lisboa, Lisbon, Portugal, Centre of Marine Sciences (CCMAR),
Universidade do Algarve, Faro, Portugal, Instituto Dom Luís, Universidade de
Lisboa, Lisboa Portugal, and Universidade do Algarve,Faro, Portugal
Department of Data Analysis and AI, Higher School of Economics, Moscow,
Russia, Department of Computer Science, Birkbeck University of London,
London, UK)
Contact [email protected]
Keywords Spatiotemporal clustering, Time series segmentation, Data recovery clustering,
SST images, Coastal upwelling
xxiv Program and Abstract Book

A Three-Step Method for Audience Extension in Internet

Advertising using an Industrial Taxonomy

Authors Dmitry Frolov, Zina Taran

Abstract The paper addresses a very common problem in targeted digital
advertising, insufficient audience size. Many approaches to audience
extension frequently lead to much diminishing quality metrics, such
as audience quality or conversion rates. This is the case, for
example, for so-called lookalike techniques. We present a novel
method for the efficient extension of target audiences. Our base is a
popular taxonomy of user interests, the IAB contents taxonomy,
combined with the representation of browsing behavior of millions
of users by fuzzy sets of visited IAB taxonomy segments, that are
leaves of the taxonomy tree. We use this idea in our method. The
method consists of three steps: (1) computing membership values
for the interest segments for a user by a classifier; (2) performing
generalization of those sets and obtaining highranked segments,
which is a core part of the method; (3) obtaining a set of advertising
campaigns for a user. Our method involves an algorithm for
optimally lifting individual fuzzy leaf sets into a higher rank
taxonomy node, a so-called “head subject”. The head subject must
cover the input fuzzy leaf set in such a way that the number of errors
is minimized. This algorithm was proposed as an intelligent
information retrieval tool. It can be applied, however, to a very
different task of targeted advertisement. To extend the audiences of
a targeted advertisement, we find their head subjects off-line. Given
a set of taxonomy segments corresponding to targeted audiences, we
include a user as a target if her head subject covers any of those
segments. This lifting-based step does increase the number of
successful matches between user segments and campaign segments
two- or three-fold without losing in the targeting quality, the
click-through rate, CTR. This is in stark contrast to the conventional
look-alike methods for increasing the audience numbers by reducing
the admissibility thresholds, which leads to a large decrease in CTR,
which was experimentally proven.
Affiliation Department of Data Analysis and Artificial Intelligence, HSE
University, Moscow, Russian Federation, Division of Management,
Marketing, and Business Administration, Delta State University,
Cleveland, MS, USA
Contact [email protected], [email protected]
Keywords Targeted advertising, Audience extension, Taxonomy, Fuzzy leaf
set, Optimal lifting
Program and Abstract Book xxv

SARAH-Based Variance-Reduced Algorithm for Stochastic

Finite-Sum Cocoercive Variational Inequalities

Authors Aleksandr Beznosikov, Alexander Gasnikov

Abstract Variational inequalities are a broad formalism that encompasses a vast number
of applications. Motivated by applications in machine learning and beyond,
stochastic methods are of great importance. In this paper, we consider the
problem of stochastic finite-sum cocoercive variational inequalities. For this
class of problems, we investigate the convergence of the method based on the
SARAH variance reduction technique. We show that for strongly monotone
problems it is possible to achieve linear convergence to a solution using this
method. Experiments confirm the importance and practical applicability of our
approach.
Affiliation Moscow Institute of Physics and Technology (MIPT), Dolgoprudny, Moscow
region, Russian Federation; HSE University, Moscow, Russian Federation, IITP
RAS, Moscow, Russian Federation, Caucasus Mathematical Center, Adyghe
State University, Maikop, Russian Federation
Contact [email protected], [email protected]
Keywords Stochastic optimization, variational inequalities, finite sum problems

A Parallel Linear Active Set Method

Authors E. Dov Neimand, Şerban Sabău

Abstract Given a linear-inequality-constrained convex minimization problem in a Hilbert
space, we develop a novel binary test that examines sets of constraints and
passes only active-constraint sets. The test employs a blackbox,
linear-equality-constrained convex minimization method but can often fast fail,
without calling the black-box method, by considering information from
previous applications of the test on subsets of the current constraint set. This fast
fail, as a function of the number of dimensions, has quadratic complexity and
can be completely multi-threaded down to near-constant complexity. Only when
the test is unable to fast fail, does it use the blackbox method. In both cases, the
test generates the optimal point over the subject inequalities. Iterative and
largely parallel applications of the test over growing subsets of inequality
constraints yields a minimization algorithm. We also include an adaptation of
the algorithm for a non-convex polyhedron in Euclidean space. Outside of
calling the black-box method, complexity is not a function of accuracy. The
algorithm does not require the feasible space to have a non-empty interior, or
even be nonempty. With ample threads, the multi-threaded complexity of the
algorithm is constant as a function of the number of inequalities.
Affiliation Department of Electrical and Computer Engineering, Stevens Institute of
Technology
Contact [email protected], [email protected]
Keywords Convex Optimization, Non-Convex Polyhedron, Hilbert Space, Strict Convexity,
Parallel Optimization
xxvi Program and Abstract Book

Visual Explanable Machine Learning for High-Stake

Decision-Making with Worst Case Estimates

Authors Charles Recaido, Boris Kovalerchuk

Abstract A major motivation for explaining and rigorous evaluating Machine Learning
(ML) models is coming from high-stake decision-making tasks like cancer
diagnostics, self-driving cars, and others with possible catastrophic
consequences of wrong decisions. This paper shows that visual knowledge
discovery (VKD) methods, based on the General Line Coordinates (GLC)
recently developed, can significantly contribute to solving this problem. The
concept of hyperblocks (n-D rectangles) as interpretable dataset units and GLC
are combined to create visual selfservice machine learning models. Two variants
of Dynamic Scaffold Coordinates (DSC) are proposed. It allows losslessly
mapping high-dimensional datasets to a single two-dimensional Cartesian plane
and building interactively an ML predictive model in this 2-D visualization
space. Major benefits of DSC1 and DSC2 are their highly interpretable nature.
They allow domain experts to control or establish new machine learning models
through visual pattern discovery. It opens a visually appealing opportunity for
domain experts, who are not ML experts, to build ML models as a self-service
bringing the domain expertise to the model discovery, which increases model
explainability and trust for the end user. DSC were used to find, visualize, and
estimate the worst-case validation splits in several benchmark datasets, which is
important for high-risk application. For large datasets, DSC is combined with
dimensionality reduction techniques such as principal component analysis,
singular value decomposition, and t-distributed stochastic neighbor embedding.
A software package referred to as Dynamic Scaffold Coordinates
Visualization System (DSCViz) was created to showcase the DSC1 and DSC2
systems.
Affiliation Department of Computer Science, Central Washington University, USA
Contact [email protected], [email protected]
Keywords Explainable machine learning, Visualization, Hyperblock, Multidimensional
coordinate system, Self-service model.

Preferences over Mixed Manna

Authors Alexander Karpov

Abstract We define a new class of Condorcet domains, which are called GF-domains.
GF-domains are unique Condorcet domains that are weakly minimally rich,
semi-connected and contain a pair of mutually reverse preference orders.
GF-domains are single-peaked on a circle that leads to a clear interpretation.
Affiliation HSE University, Moscow, Russian Federation, Institute of Control Sciences
Russian Academy of Sciences, Moscow, Russian Federation
Contact [email protected]
Keywords Restricted domains, majority voting, single-peaked on a circle, circular city model
Contents

Optimal Layered Defense For Site Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Tsvetan Asamov, Emre Yamangil, Endre Boros, Paul B. Kantor,
and Fred Roberts
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Mathematical Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Exhaustive Search Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Dynamic Programming Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7 The Case of an Adaptive Adversary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
8 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
9 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Developing and Running a Set of Competitive College/University
Blended Courses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Alexander S. Belenky
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 A Brief Review of Research Publications in Five Major Areas
Related to the Use of Blended Courses in the Education Process . . . . . . . . 26
3 The Statements of the Problems and Their Mathematical Formulation . . 31
4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
SARAH-Based Variance-Reduced Algorithm for Stochastic
Finite-Sum Cocoercive Variational Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Aleksandr Beznosikov and Alexander Gasnikov
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2 Problem Setup and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 Main Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

xxvii
xxviii Contents

Dimensionality Reduction Using Pseudo-Boolean Polynomials

for Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Tendai Mapungwana Chikake and Boris Goldengorin
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1 The Iris Flower Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 The Wisconsin Diagnostic Breast Cancer (WDBC) Dataset . . . . 66
5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 The Iris Flower Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 The Wisconsin Diagnostic Breast Cancer (WDBC) Dataset . . . . 69
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Pseudo-Boolean Polynomials Approach to Edge Detection and
Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Tendai Mapungwana Chikake, Boris Goldengorin, and Alexey Samosyuk
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.1 Dubai Landscape Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Purifying Data by Machine Learning with Certainty Levels. . . . . . . . . . . . . . . . 89
Shlomi Dolev and Guy Leshem
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2 Addition of Byzantine Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3 Corruption of Existing Data, Single Feature Learning with a
Certainty Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4 Corruption of Existing Data, Multi-Feature Learning (with a
New Decision Trees Algorithm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
On Impact of Data Models on Predictability Assessment of Time Series . . 103
Sergey Frenkel
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
2 Description and Statement of the Prediction Problem . . . . . . . . . . . . . . . . . . . . 105
3 Conceptual Model of the Prediction Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.1 Requirement for Consistency of CM Components. . . . . . . . . . . . . . . 108
3.2 Generalized Representation of the Conceptual Model
of the Prediction Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Contents xxix

4 Requirements for Mathematical Models of Prediction . . . . . . . . . . . . . . . . . . . 112

4.1 About Binary Data Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2 Performance Criterion Not Related to Data Distributions . . . . . . . 116
4.3 Influence of Probabilistic Properties of Random Time
Series on Prediction Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.4 Self-Similarity as a Nonlinear Relationship Between
the Past and the Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.5 Stationarity as a Property of a Random Process
Affecting Predictability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.6 Influence of Probability Distributions of Processes
with LRD on Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.7 Nonlinear Models and Neural Algorithms . . . . . . . . . . . . . . . . . . . . . . . 125
5 Properties of Probability Distributions as Characteristics
of the Predictability of the Sign of Increments of Random Time
Series and Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.1 Binary Prediction Model of Sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.2 On the Connection Between the Properties
of Autocorrelation Functions of Random Processes
and the Probabilities of Changing Signs of Increments . . . . . . . . . . 129
5.3 Joint Use of SC Predictor and 1/0 Predictor . . . . . . . . . . . . . . . . . . . . . . 131
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A Three-Step Method for Audience Extension in Internet
Advertising Using an Industrial Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Dmitry Frolov and Zina Taran
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
2 Programmatic Targeting in Internet Advertising . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3 Parsimoniously Generalization a Fuzzy Thematic Subset in Taxonomy. 137
4 A 3-Step Method, Developing and Lifting of User Segments Profiles . . . 141
5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6 Conclusion and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
From Prebase in Automata Theory to Data Analysis: Boris
Mirkin’s Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Boris Goldengorin
1 Mirkin’s Prebase in Abstract Automata Theory: A Warm Up. . . . . . . . . . . . 147
2 Jump to Group Choice and Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
3 Last Universal Cellular Ancestor (LUCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4 Who Is Boris Mirkin? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
xxx Contents

Manipulability of Aggregation Procedures for the Case of Large

Numbers of Voters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Alexander Ivanov
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
2 Main Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
2.1 Aggregation Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
2.2 Manipulation and Manipulability Index (NK-Index) . . . . . . . . . . . . 159
2.3 Extended Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
2.4 Impartial Culture and Computer Modeling Scheme . . . . . . . . . . . . . 161
2.5 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Preferences over Mixed Manna. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Alexander Karpov
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
3 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
About Some Clustering Algorithms in Evidence Theory . . . . . . . . . . . . . . . . . . . . 179
Alexander Lepskiy
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
2 Necessary Information from Evidence Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3 Basic Approaches for Clustering Body of Evidence . . . . . . . . . . . . . . . . . . . . . . 181
3.1 Hierarchical Inner and Outer Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . 182
3.2 Clustering Based on Conflict Density Distribution . . . . . . . . . . . . . . 183
3.3 Clustering Based on Conflict Optimization . . . . . . . . . . . . . . . . . . . . . . 185
4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Inferring Multiple Consensus Trees and Supertrees Using
Clustering: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Vladimir Makarenkov, Gayane S. Barseghyan, and Nadia Tahiri
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
3 Cluster Validity Indices Adapted to Tree Clustering . . . . . . . . . . . . . . . . . . . . . . 202
3.1 Caliński-Harabasz Cluster Validity Index Adapted for
Tree Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
3.2 Ball-Hall Index Adapted for Tree Clustering. . . . . . . . . . . . . . . . . . . . . 204
3.3 Silhouette Index Adapted for Tree Clustering . . . . . . . . . . . . . . . . . . . . 205
3.4 Gap Statistic Adapted for Tree Clustering. . . . . . . . . . . . . . . . . . . . . . . . 206
4 Example of Application to Evolutionary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Contents xxxi

Anomaly Detection with Neural Network Using a Generator. . . . . . . . . . . . . . . 215

Alexander S. Markov, Evgeny Yu. Kotlyarov, Natalia P. Anosova,
Vladimir A. Popov, Yakov M. Karandashev, and Darya E. Apushkinskaya
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
2 Problem Statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
3.1 Overview of Existing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
3.2 Image Preprocessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
3.3 Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
3.4 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
4 Generator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Controllability of Triangular Systems with Phase Space Change . . . . . . . . . . 225
Irina Sergeevna Maximova
1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
2 Problem Statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
3 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
A Parallel Linear Active Set Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
E. Dov Neimand and Şerban Sabău
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
2 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
3 The Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
4 Polyhedral Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
4.1 Preliminary Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
4.2 The Necessary Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
4.3 The Sufficient Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
5 Algorithm Proofs and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
5.1 Proof of Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
5.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6 Non-Convex Polyhedra. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Mean Values: A Multicriterial Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Vladislav V. Podinovski and Andrey P. Nelyubin
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
2 Definition of Mean Values as Nondominated Points. . . . . . . . . . . . . . . . . . . . . . 258
3 Mean Values for Equally Important Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
4 Mean Values for Equally Important Criteria Measured
on the First Ordered Metric Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
5 On the Construction of Sets of Mean Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
xxxii Contents

6 On Comparing Multi-valued Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

7 On Stability of Pn-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
8 The Case of Data with Repetitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Data and Text Interpretation in Social Media: Urban Planning Conflicts. 271
Maria Pilgun and Nailia Gabdrakhmanova
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
2.1 Text Analysis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
2.2 Approaches to Mathematical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 274
3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
3.1 General Description of the Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
3.2 Neural Network Semantic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
3.3 Simulation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Visual Explainable Machine Learning for High-Stakes
Decision-Making with Worst Case Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Charles Recaido and Boris Kovalerchuk
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
1.1 Motivation and Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
1.2 Forms of General Line Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
1.3 Related Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
2 Dynamic Scaffold Coordinates with Hyperblocks . . . . . . . . . . . . . . . . . . . . . . . . 300
2.1 Dynamic Scaffolding Coordinates Based on Parallel
Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
2.2 Dynamic Scaffolding Coordinates Based on Shifted
Paired Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
3 Methods and Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
3.1 Overview of Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
3.2 DT Hyperblocks with Iris Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
3.3 DT Hyperblocks with Seeds Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
3.4 DT Hyperblocks and Principal Components
with Wisconsin Breast Cancer Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
3.5 MNIST Handwritten Digits with Autoencoder
and T-SNE Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
3.6 Comparison of Worst and Average Validation Sets for
Different Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
4 DSCViz Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
A.1 Appendices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
A.1.1 Appendix A – Seeds Dataset Decision Tree . . . . . . . . . . . . . . . . . . . . . . 325
Contents xxxiii

A.1.2 Appendix B – Wisconsin Breast Cancer Dataset

Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Algorithm of Trading on the Stock Market, Providing
Satisfactory Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Alexander Rubchinsky and Kristina Baikova
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
2 Preliminary Description of the Suggested General Scheme . . . . . . . . . . . . . . 334
3 Detailed Description of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
Appendix. Algorithm A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
Classification Using Marginalized Maximum Likelihood
Estimation and Black-Box Variational Inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Soroosh Shalileh
1 Introduction: Background, Previous Works, and Motivation . . . . . . . . . . . . . 349
2 A Marginalized Likelihood Estimation Using Black Box
Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
3 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
3.1 Purposes of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
3.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
3.3 Competitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
3.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
4.1 Study the Impact of the Existence of Non-informative
Features vs. Fixed Number of Informative Features:
Balanced and Imbalanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
4.2 Study the Performance of Algorithms over
Moon-Shape Data Set: Balanced and Imbalanced . . . . . . . . . . . . . . . 357
4.3 Experimental Results at Real-World Data Sets . . . . . . . . . . . . . . . . . . 358
5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Generating Genomic Maps of Z-DNA with the Transformer Algorithm . . 363
Dmitry Umerenkov, Vladimir Kokh, Alan Herbert, and Maria Poptsova
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
2 Material and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
2.1 Z-DNA Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
2.2 Benchmark Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
2.3 Fine-Tuning DNABERT Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
3 Z-DNABERT: Fine-Tuning DNABERT for Z-DNA Prediction . . . . . . . . . 367
4 Z-DNABERT Cross-Species Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
5 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
xxxiv Contents

Manipulation by Coalitions in Voting with Incomplete Information . . . . . . . 377

Yuliya A. Veselova
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
2.1 Definitions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
2.2 Poll Information Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
2.3 Individual Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
2.4 Coalitional Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
2.5 Social Choice Correspondences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
3 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
4 Computational Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
Rethinking Probabilistic Topic Modeling from the Point of View
of Classical Non-Bayesian Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Konstantin Vorontsov
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
2 Maximization on Unit Simplices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
3 Probabilistic Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
4 Additive Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
5 Comparison with Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
6 Overview of Models and Regularizers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
7 Hypergraph Topic Models of Transactional Data . . . . . . . . . . . . . . . . . . . . . . . . . 408
8 Hypergraph Recommender Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
9 Sequential Text Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
10 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
About the Editors

Boris Goldengorin is the author and inventor of data correcting and tolerance-
based algorithms applied to many problems in operations research, supply chain
management, quantitative logistics, industrial engineering, data and stock market
analysis. Boris is the author of more than 100 articles published in leading
international journals, including the Journal of Algebraic Combinatorics, Dis-
crete Optimization, Journal of Combinatorial Optimization, Journal of Global
Optimization, Operations Research, Management Science, European Journal of
Operational Research, Journal of Operational Research Society, Mathematical
and Computer Modelling, Computers and Operations Research, Computers and
Industrial Engineering, Expert Systems with Applications, Journal of Heuristics,
Optimization Methods and Software, Computational Management Science, and
many others. Dr. Goldengorin has published four monographs, three textbooks, and
an editor of five books on mathematical programming, game theory, combinatorial
optimization, network analysis algorithms, graph theory, and big data analysis.
He is an associate editor of Journal of Global Optimization, Journal of Combina-
torial Optimization, SN Operations Research Forum, and member of the Editorial
Board of the Journal of Computational and Applied Mathematics of the Taras
Shevchenko National University of Kyiv, Ukraine.

Sergei Kuznetsov graduated from the Faculty of Applied Mathematics and Control
of the Moscow Institute for Physics and Technology in 1985. He obtained his
Doctor of Science Degree in Theoretical Computer Science at the Computing Center
of Russian Academy of Science. Since 2006, he is the Head of Department for
Data Analysis and Artificial Intelligence, Head of the International Laboratory
for Intelligent Systems and Structural Analysis, and Academic Supervisor of the
Data Science master program at National Research University Higher School of
Economics (Moscow). His research interests are in the algorithms of data mining,
knowledge discovery, and formal concept analysis.

xxxv
Optimal Layered Defense For Site
Protection

Tsvetan Asamov, Emre Yamangil, Endre Boros, Paul B. Kantor,

and Fred Roberts

We are pleased to dedicate this paper to our friend and

colleague Boris Mirkin on the occasion of his 80th birthday.

1 Introduction

We study the problem of defending a target such as a stadium or a large gathering

place with multiple access paths. In practice, the notion of “layered defense” is
commonly used to describe the idea that we have an outer perimeter where we
first seek to capture dangerous entities (vehicles, people, cargo), then perhaps a
middle perimeter or perimeters where we do the same thing using different methods
and perhaps information gathered from the outer perimeter, and then an inner
perimeter where we again use different methods and information gathered from
earlier perimeter defense. Thus, as vehicles approach a stadium, we might do license
plate reading; in a middle layer or layers we use radiation detectors or behavioral
detection of patrons after they have parked their cars; then in an inner layer we
use metal detection through wanding or walkthrough magnetometers or where we
inspect bags or pat-down patrons. We seek to make this idea of layered defense
precise in an abstract, simplified way.
We speak abstractly of “sensors” at each layer of defense, but understand that
our “sensors” could be physical sensors but also tests of different kinds such as
behavioral observation. Our approach is based in an increasing literature that deals
with inspection processes using a number of potential tests, for example at ports of
entry. In the past few years numerous techniques for sensor optimization of port-of-
entry inspection have been explored in the literature [1, 5–9, 12–14, 16, 17]. Several

T. Asamov · E. Yamangil · E. Boros

Rutgers Center for Operations Research, New Brunswick, NJ, USA
P. B. Kantor · F. Roberts ()
CCICADA Center, Rutgers University, New Brunswick, NJ, USA
e-mail: [email protected]; [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1

B. Goldengorin, S. Kuznetsov (eds.), Data Analysis and Optimization, Springer
Optimization and Its Applications 202,
https://doi.org/10.1007/978-3-031-31654-8_1
2 T. Asamov et al.

authors have reported numerical results that demonstrate significant improvement

over straightforward inspection approaches [1, 6, 7, 12, 13, 16]. In line with existing
practices, most researchers have assumed that the vast majority of the inspected
items and people are perfectly legal and only a very small proportion of the incoming
flow is harmful. Under such circumstances, the sensor operating cost (though not the
capital cost) is usually only a small fraction of the overall cost of the inspection
operation. The bulk of the total cost and time spent is attributed to a thorough
inspection procedure that is performed on potential suspicious items and individuals.
Such a situation is usually encountered in airport security checkpoints, border
crossings, maritime port inspection stations, large sports stadiums, etc.
In mathematical terms, the problem of layered site security is quite different from
optimizing a set of sensors searching for illegal cargo at a port-of-entry. Unlike
much of the existing inspection optimization work which considers two distinct
populations of inspected items, i.e. legal and illegal, in our model we only consider
the latter. We work under the assumption that we can incorporate the processing
cost of occasional encounters of legal traffic into the overall cost curve for detecting
contraband.

2 Mathematical Model

To develop our ideas, we have formulated a model of a perimeter defense of the

target with two layers of defense where we have a limited budget for surveillance
and we need to decide how much to invest in each layer and where to invest it if
there are several locations where we might do inspection in each layer. Defense at
the outer layers might be less successful but could provide useful information to
selectively refine and adapt strategies at inner layers. Arranging defense in layers
so that decisions can be made sequentially might significantly reduce costs and
increase chance of success. Monitoring at an outer layer could not only hinder an
attacker but could provide information about the current state of threat that could
be used to refine and adapt strategies at inner layers. There is a complex tradeoff
between maximizing the cost-effectiveness of each layer and overall benefits from
devoting some efforts at the outer layer to gathering as much information as possible
to maximize effectiveness of the inner layer.
To give a stylized abstract version of what we have in mind, consider Fig. 1,
where we show a target in the middle, threats arrive via two inner channels and each
is reachable from two outer flows of vehicles, patrons, etc.

Fig. 1 An abstract model of

layered defense showing a
target in the middle, threats
arriving via two inner
channels, and each reachable
from two outer flows of
vehicles, patrons, etc.
Optimal Layered Defense For Site Protection 3

We concentrate in this paper on the problem with two layers of defense, where
each security layer has a number of sensors placed on possible paths of incoming
illegal flow of vehicles and/or patrons. Inner layers are composed of sensors that are
used to detect units that have managed to infiltrate outer layers undetected. Every
interior sensor is connected to one or more sensors in the immediately preceding
outer layer in the sense that it is responsible for backing up those sensors, i.e., the
goal of the interior sensor is to discover traffic which has remained undetected by
those outer sensors. In order for an illegal unit to penetrate the system, it would
need to remain unnoticed at all layers of inspection. We denote the set of sensors
in the internal layer of defense with I , and the set of sensors in the external layer
with J . We assume that sensors in the set I share a limited total resource budget X
and sensors in the set J share a limited total budget Y . More subtle models allow
one to make decisions about how much budget to allocate between the inside and
outside layers. Our objective is to develop optimization methods to determine the
optimal allocation of resources to security sensors in such a manner that the expected
detection rate of incoming threats is maximized. For modeling purposes we employ
the following assumptions:
• There exists only one type of violation that we are protecting against.
• The expected number of contraband units on each incoming path is a known
parameter. For the outermost perimeter sensor j , we denote the incoming
contraband flow with .Fj .
• For each sensor .i ∈ I , located at the inside security layer, we know the function
x
.D (x), which specifies the detection rate at the sensor for contraband items if
i
the total amount of resources made available to the sensor is x, and similarly
y
for each .j ∈ J we know the detection function .Dj (y). In this paper we assume
that the detection functions are specified as concave increasing piecewise linear
functions. Thus, we do not require the detection functions to be differentiable
everywhere, which is an important property of our method. We also assume that
the resources are normalized to take values between 0 and 1.
• All sensors at a given layer share a limited common resource. For example,
an outside perimeter could be supported by a fixed number of infrared motion
detectors or license plate readers, while an inside perimeter could consist of
walkthrough magnetometer tests or security guards conducting wanding of
patrons.
Our goal is to allocate the total outside resources among individual sensors and
allocate the total inside resources among individual sensors in order to maximize
the detected illegal flow. Thus we arrive at the following mathematical formulation:
4 T. Asamov et al.

⎧⎛ ⎞ ⎛ ⎞⎫
⎨ ⎬
max ⎝ Fj · Dj (yj )⎠ + Dix (xi ) ⎝
y
Fj (1 − Dj (yj ))⎠
y
x,y ⎩ ⎭
i∈I j ∈N (i) j ∈N (i)

s.t. xi ≤ X
i∈I
.
(1)
yj ≤ Y
j ∈J

xi ≥ 0, ∀i ∈ I
yj ≥ 0, ∀j ∈ J

where .N (i) denotes the set of outside sensors adjacent to inside sensor i. Here, the
first sum over the outside neighbors j of i gives the flow that is captured at j and
the second sum gives the flow that is not captured at j but is captured at i.
Now, let us examine the given objective function. Clearly, it contains mixed
nonlinear product terms of detection probabilities. Moreover, since there are no pure
quadratic terms, we know that in general the objective function is neither convex,
nor concave. We illustrate this in Example 2.1.
Example 2.1 (Indefinite Objective Function) Suppose we have a single exterior
sensor j preceding a single interior sensor i, as shown in Fig. 2. In this case, we
would need to solve the following problem.

y y
max Fj · Dj (yj ) + Dix (xi ) · (Fj · (1 − Dj (yj )))
xi ,yj
(2)
s.t. 0 ≤ xi ≤ X
.

0 ≤ yj ≤ Y

y
Let us for a moment consider what would happen if .Dix (x) = x and .Dj (y) = y as
y
shown in Fig. 3. In that case, .Dix and .Dj are differentiable everywhere. Thus, if we
denote the objective function in problem (2) as
y y
.f i,j (xi , yj ) = Fj · Dj (yj ) + Dix (xi ) · (Fj · (1 − Dj (yj )))

then we know

Fig. 2 A model of layered defense showing a target with a single exterior sensor preceding a
single interior sensor
Optimal Layered Defense For Site Protection 5

Fig. 3 Linear detection rates at both the exterior and interior sensors of Fig. 2

⎡ ⎤
∂f i,j (xi ,yj )
∇f i,j (xi , yj ) = ⎣ ∂xi
∂f i,j (xi ,yj )
⎦
∂yj

y
Fj (1 − Dj (yj ))
=
.
Fj (1 − Dix (xi )) (3)

Fj (1 − yj )
=
Fj (1 − xi )

0
≥
0

Therefore the objective function (Fig. 4) is increasing everywhere in the feasible

region. Thus, we know that we would get an optimal solution to problem (2) by
setting .xi = X and .yj = Y . However, upon further inspection we can notice that if
we attempted to solve the problem as a convex optimization problem, we would run
into difficulties. The Hessian of the objective function has the following form:

0 −Fj
. ∇ f
2 i,j
(xi , yj ) = (4)
−Fj 0

And its two eigenvalues are .λ1 = Fj and .λ2 = −Fj . Hence we know that the
Hessian matrix associated with the quadratic terms is indefinite.
The indefiniteness of the Hessian presents a major obstacle to solving the
problem with standard solvers for quadratic programming. In our study, we tried
solving numerous instances using different methods implemented in the MATLAB
optimization toolbox. While in some cases we were able to produce consistent
output, none of the examined methods were able to overcome the indefiniteness
of the Hessian matrix for all possible values of the input parameters. This created
the need for the development of an alternative solution method for the problem.
6 T. Asamov et al.

y
Fig. 4 A plot of the objective function of (2) for the case of .Dix (xi ) = xi , Dj (yj ) = yj and
.Fj = 1. In this case three of the corners of the feasible region are optimal solutions

3 Exhaustive Search Methods

A standard approach to such problems is a brute force approach that fixes a

resource partition mesh and enumerates all possibilities. This exhaustive search
approach would be to discretize the resource space for each sensor into a number
of subintervals. Then we could examine every possible resource allocation scenario
and among all feasible cases select the one that maximizes the objective function
value. However, this method would be computationally intractable even for trivial
cases. For example, suppose that we have four inside sensors, and each of them
is related to exactly two outside sensors. Further, suppose we split the parameter
search space of each sensor into one hundred discrete intervals. Then we would
need to evaluate the objective function a total of .1004+8 = 1024 times, which is
clearly unacceptable unless a very large cluster is used. Moreover, if we considered
a slightly larger case of fifteen interior sensors, each supporting a couple of outside
perimeter sensors, then the number of cases explodes to .10015+30 = 1090 , which
exceeds the current estimates for the number of atoms in the universe.
Optimal Layered Defense For Site Protection 7

However, it is sufficient to discretize the parameter space for the interior sensors.
Then, for each fixed set of values, we can find the optimal configuration of
the exterior perimeter by solving a linear programming problem. If we take this
approach, then the two above mentioned instances require, respectively, the solution
of .108 and .1030 small linear programming problems. While this is a significant
improvement, we are still subjected to the curse of dimensionality as the number
of sensors in the interior perimeter increases. Fortunately, we can overcome this
challenge.

4 Dynamic Programming Method

To illustrate the idea behind our method, we consider the following basic example.
y
Suppose we would like to solve problem (2) for the case when .Dix and .Dj are
general piecewise linear functions, as illustrated in Fig. 5.
In that case, the objective function .f i,j (xi , yj ) of problem (2) is still continuous.
Further, .f i,j (xi , yj ) is differentiable everywhere except at points corresponding to
y
corner points of the detection functions .Dix and .Dj . Moreover, at points where
.f
i,j (x , y ) is differentiable, its gradient has the form
i j

∂f (xi ,yj )
∂xi
∇f i,j
(xi , yj ) = ∂f (xi ,yj )
∂yj

y
ci Fj (1 − Dj (yj )) (5)
.
=
cj Fj (1 − Dix (xi ))

0
≥
0

Fig. 5 Piecewise linear detection rates at both the exterior and interior sensors of Fig. 2
8 T. Asamov et al.

for some constants .ci , cj ≥ 0. Thus, the optimal solution is again obtained by setting
xi = X and .yj = Y .
.

Our solution method involves four main steps.

Step 1
For a fixed . > 0, we create a partition .Y = {0, , 2, . . . , Y } of the interval .[0, Y ],
as well as a partition .X = {0, , 2, . . . , X} of the interval .[0, X].
Step 2
For every pair of sensors .i, j such that .i ∈ I and .j ∈ N(i), we compute .T i,j (Xi , Yj )
for each .(Xi , Yj ) ∈ X × Y, where we use .T i,j (Xi , Yj ) to denote the maximum
amount of detected illegal contraband when inner sensor i uses at most .Xi inner
resources, and outer sensor j uses at most .Yj outer resources. Formally, we need to
compute the optimal value of the following optimization problem:
⎧ ⎫
y y
⎪
⎪ max Fj · Dj (yj ) + Dix (xi ) · (Fj · (1 − Dj (yj ))) ⎪
⎪
⎪
⎨ xi ,yj ⎪
⎬
T i,j (Xi , Yj ) = (6)
s.t. 0 ≤ xi ≤ Xi
.
⎪
⎪ ⎪
⎪
⎪
⎩ ⎪
⎭
0 ≤ yj ≤ Yj

We can solve an instance of problem (6) in .O(1) time by setting .xi = Xi and
.yj = Yj . Thus, we can compute .T i,j (Xi , Yj ) for each .(Xi , Yj ) ∈ X × Y in
.O(|X ||Y|). Hence, we can plot the objective function in (2) with arbitrary precision

which would ultimately allow us to solve problem (1) with arbitrary precision (see
Fig. 6).
Step 3
Now, suppose that instead of a single outside sensor j , we consider two outside
sensors .j1 and .j2 that are both backed up by inner sensor i (see Fig. 7). We use
.{j1 , j2 } to denote quantities that refer to the combined system with two outside

sensors .j1 and .j2 . For example, .T i,{j1 ,j2 } denotes a table of optimal detection values
for the combined system of inside sensor i and two outside sensors .j1 and .j2 .
Further, we also use .Xi,{j1 ,j2 } and .Y i,{j1 ,j2 } to denote the amount of inside and
outside resources budgeted to the sensor system of inner sensor i, and outer sensors
.j1 and .j2 . Please recall that in Step 2, we computed the two tables of optimal

values .T i,j1 and .T i,j2 (considering first the problem with only sensors .i, j1 and
then the problem with only sensors .i, j2 (see Fig. 8)). Since they both share the
same inner sensor, all we need to do in order to find the optimal detection value
.T
i,{j1 ,j2 } (X i,{j1 ,j2 } , Y i,{j1 ,j2 } ) for the combined system is to determine the optimal

way to allocate .Y i,{j1 ,j2 } between sensor .j1 and .j2 . Thus we have the problem of
optimizing the following formulation:
Optimal Layered Defense For Site Protection 9

y
Fig. 6 A plot of the objective function of (2) for piecewise linear functions .Dix (xi ) and .Dj (yj )

.T i,{j1 ,j2 } Xi,{j1 ,j2 } , Y i,{j1 ,j2 } =
⎧ ⎫
⎪
⎪ max T i,j1 Xi,{j1 ,j2 } , y i,j1 + T i,j2 Xi,{j1 ,j2 } , y i,j2 ⎪
⎪
⎪
⎪ y i,j1 ,y i,j2 ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ s.t. ⎪
⎪
⎨ ⎬
= y i,j1 + y i,j2 ≤ Y i,{j1 ,j2 } (7)
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ y i,j1
∈ Y ⎪
⎪
⎪
⎪ ⎪
⎪
⎩ i,j2 ⎭
y ∈Y

i,j i,j
Notice that even though we consider all different values of .yj 1 , yj 2 ∈ Y,
problem (7) can be solved in time linear in the cardinality of .Y. This is accomplished
by using two index variables initialized at the two ending points of the outside
resource partition .Y.
If we have three outside sensors .j1 , j2 , j3 corresponding to inside sensor i, then
we can find their solution matrix .T i,{j1 ,j2 ,j3 } as follows. Once we have computed
10 T. Asamov et al.

Fig. 7 A network with two

outside sensors (green and
blue) and one inside sensor
backing them both up

Fig. 8 Finding separate solutions for inner sensor i with each outside sensor .j1 and .j2

the matrix .T i,{j1 ,j2 } we use it as an input to Eq. (7), together with the matrix .T i,j3
to generate the solution matrix .T i,{j1 ,j2 ,j3 } . By recursion, we can solve a problem
instance that involves one inner sensor i and any number of outside sensors. We
denote with .T i the solution table of optimal values corresponding to inner sensor i
together with all of its adjacent outside sensors .j ∈ N (i).
Step 4
Suppose we are given two matrices, .T i1 and .T i2 , that correspond respectively to two
inner sensors .i1 and .i2 with their adjacent outside sensors. In order to determine the
optimal detection value .T {i1 ,i2 } (X{i1 ,i2 } , Y {i1 ,i2 } ) for the combined system, we need
{i ,i }
to find the optimal way to allocate .Xi 1 2 between .T i1 and .T i2 . Thus we consider
the problem of optimizing the following,
Optimal Layered Defense For Site Protection 11

⎧ ⎫
⎪ max T i1 x i1 , y i1 + T i2 x i2 , y i2 ⎪
⎪
⎪ ⎪
⎪
⎪
⎪
x i1 ,x i2 ,y i1 ,y i2 ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪
s.t. ⎪
⎪
⎪
⎪ ⎪
⎪
⎨ i1 {i1 ,i2 } ⎬
{i1 ,i2 } {i1 ,i2 } {i1 ,i2 } x +x ≤X i2
.T X ,Y = (8)
⎪
⎪ ⎪
⎪
⎪
⎪
⎪ y i1 + y i2 ≤ Y {i1 ,i2 } ⎪
⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ x i1
, x i2
∈ X ⎪
⎪
⎪
⎪ ⎪
⎪
⎩ i1 i2 ⎭
y ,y ∈ Y

We point out that problem (8) can be solved in .O(|X ||Y|) time. Finally, once we
have computed .T {i1 ,i2 } , we can proceed by recursion to solve problems involving an
arbitrary number of interior and exterior sensors.

5 Running Time

We consider the running times of all of the four steps.

Step 1: Creating the partitions takes .O(|X | + |Y|).
Step 2: For every pair .i, j such that .i ∈ I, j ∈ N(i), we have to compute a matrix
in .O(|X ||Y|) time. Thus, step 2 takes .O(|X ||Y||I ||J |).
Step 3: For every .i ∈ I we perform step 3 .|N(i)| times, and every time we
need to compute .|X ||Y| number of entries, each taking .O(|Y|). Thus the overall
complexity of step 3 is .O(|X ||Y|2 |I ||J |).
Step 4: We need to execute this step .|I | − 1 times. Each of .|X ||Y| entries in
the resulting matrix takes .O(|X ||Y|) time to compute. Thus, the computational
complexity of step 4 is .O((|X ||Y|)2 |I |).
Since the four steps are performed sequentially, we know the overall running
time of the dynamic programming method is .O(|X ||Y|2 |I |(|J | + |X |)).

6 Convergence

So far, we have only considered discrete approximations of the optimal detection

values. In this section we show that as the partition mesh . → 0, the values of the
discrete approximation tables T converge to the true continuous optimal detection
values t.
Since, the discrete approximation is exact for the case of one inside and one
outside sensor, we know

t i,j (Xi , Yj ) = T i,j (Xi , Yj ), ∀(Xi , Yj ) ∈ X × Y

.
12 T. Asamov et al.

Consider the objective function .f i,j (·, ·) of the system consisting of inner
sensor i and its external neighbor j . Since .f i,j is continuous, as well as quadratic
everywhere except for a set of measure zero, we know that .f i,j is Lipschitz
continuous and we denote its Lipschitz constant with .Li,j . If we choose .L ∈ R
such that

L = max Li,j
.
i∈I
j ∈N (i)

then L is a Lipschitz constant for all functions .f i,j .

Suppose .i ∈ I and .j1 , j2 ∈ N(i), .j1 = j2 . We use .ξ i,{j1 ,j2 } to denote the error
between the true optimal detection value .t i,{j1 ,j2 } , and the discrete approximation
.T
i,{j1 ,j2 } . We would use lower case .x i,j1 and .x i,j2 to denote the optimal way to split

up the inside resources .Xi,{j1 ,j2 } between .t i,j1 and .t i,j2 , while .y i,j1 and .y i,j2 denote
the optimal way to split up the outside resources .Y i,{j1 ,j2 } between .t i,j1 and .t i,j2 .
On the other hand, we would use .Xi,j1 and .Xi,j2 to denote the optimal way to
split up the inside resources .Xi,{j1 ,j2 } between .T i,j1 and .T i,j2 , while .Y i,j1 and .Y i,j2
denote the optimal way to split up the outside resources .Y i,{j1 ,j2 } between .T i,j1 and
.T
i,j2 . Before we proceed, we need to introduce the following notation. We use . x

to denote the point in .X that is closest to x from below, and we use . y to denote
the point in .Y that is closest to y from below. Similarly, we use . x to denote the
point in .X that is closest to x from above, and we use . y to denote the point in .Y
that is closest to y from above. Then the discrete approximation error can be written
as,

i,{j1 ,j2 }
. ξ Xi,{j1 ,j2 } , Y i,{j1 ,j2 }

= t i,{j1 ,j2 } Xi,{j1 ,j2 } , Y i,{j1 ,j2 } − T i,{j1 ,j2 } Xi,{j1 ,j2 } , Y i,{j1 ,j2 }

= t i,j1 x i,j1 , y i,j1 + t i,j2 x i,j2 , y i,j2

−T i,j1 Xi,j1 , Y i,j1 − T i,j2 Xi,j2 , Y i,j2

≤ t i,j1 x i,j1 , y i,j1 + t i,j2 x i,j2 , y i,j2

−T i,j1 x i,j1 , y i,j1 − T i,j2 x i,j2 , y i,j2

= t i,j1 x i,j1 , y i,j1 + t i,j2 x i,j2 , y i,j2

−t i,j1 x i,j1 , y i,j1 − t i,j2 x i,j2 , y i,j2

= t i,j1 x i,j1 , y i,j1 − t i,j1 x i,j1 , y i,j1

+ t i,j2 x i,j2 , y i,j2 − t i,j2 x i,j2 , y i,j2 (9)
Optimal Layered Defense For Site Protection 13

√
Since we have . t i,jk x i,jk , y i,jk − t i,jk x i,jk , y i,jk ≤ 2L for both .k =
1, 2 the bound
√
i,{j1 ,j2 }
.ξ Xi,{j1 ,j2 } , Y i,{j1 ,j2 } ≤ 2 2L (10)

follows. Now, we can also bound the error in the case of three outside sensors:

ξ i,{j1 ,j2 ,j3 } Xi,{j1 ,j2 ,j3 } , Y i,{j1 ,j2 ,j3 } =

= t i,{j1 ,j2 ,j3 } Xi,{j1 ,j2 ,j3 } , Y i,{j1 ,j2 ,j3 }

− T i,{j1 ,j2 ,j3 } Xi,{j1 ,j2 ,j3 } , Y i,{j1 ,j2 ,j3 }

= t i,{j1 ,j2 } x i,{j1 ,j2 } , x i,{j1 ,j2 } + t i,j3 x i,j3 , y i,j3

− T i,{j1 ,j2 } Xi,{j1 ,j2 } , Y i,{j1 ,j2 }

− T i,j3 Xi,j3 , Y i,j3

≤ t i,{j1 ,j2 } x i,{j1 ,j2 } , y i,{j1 ,j2 } + t i,j3 x i,j3 , y i,j3

− T i,{j1 ,j2 } x i,{j1 ,j2 } , y i,{j1 ,j2 }
.
(11)
− T i,j3 x i,j3 , y i,j3

≤ t i,{j1 ,j2 } x i,{j1 ,j2 } , y i,{j1 ,j2 } + t i,j3 x i,j3 , y i,j3

− t i,{j1 ,j2 } x i,{j1 ,j2 } , y i,{j1 ,j2 }
√
− t i,j3 x i,j3 , y i,j3 + 2 2L

= t i,j1 x i,j1 , y i,j1 − t i,j1 x i,j1 , y i,j1

+ t i,j2 x i,j2 , y i,j2 − t i,j2 x i,j2 , y i,j2
√
+ 2 2L
√ √ √
≤ 2L + 2L + 2 2L
√
= 4 2L

Proceeding by induction, we know that

14 T. Asamov et al.

√
ξ i ≤ 2 2(|J | − 1)L, ∀i ∈ I
.

since an inside sensor can have at most .|J | adjacent outside sensors.
Now, suppose that .i1 , i2 ∈ I, i1 = i2 . Then,

ξ {i1 ,i2 } X{i1 ,i2 } , Y {i1 ,i2 }

= t {i1 ,i2 } X{i1 ,i2 } , Y {i1 ,i2 } − T {i1 ,i2 } X{i1 ,i2 } , Y {i1 ,i2 }

= t i1 x i1 , y i1 + t i2 x i2 , y i2 − T i1 Xi1 , Y i1 − T i2 Xi2 , Y i2

≤ t i1 x i1 , y i1 + t i2 x i2 , y i2 − T i1 x i1 , y i1 − T i2 x i2 , y i2

≤ t i1 x i1 , y i1 + t i2 x i2 , y i2

− T i1 x i1 , y i1 − T i2 x i2 , y i2
. (12)

= t i1 i1
x , y i1
−T i1 i1
x , y i1

+ t i2 x i2 , y i2 − T i2 x i2 , y i2

+ t i1 x i1 , y i1 − t i1 x i1 , y i1

+ t i2 x i2 , y i2 − t i2 x i2 , y i2
√ √ √
≤ 2 2 (|J | − 1) L + 2 2 (|J | − 1) L + 2 2L
√
≤ 4 2 (|J |) L

We can also bound the error in the case of three inside sensors and all of their
adjacent outside sensors:

. ξ {i1 ,i2 ,i3 } (X{i1 ,i2 ,i3 } , Y {i1 ,i2 ,i3 } )

= t {i1 ,i2 ,i3 } X{i1 ,i2 ,i3 } , Y {i1 ,i2 ,i3 } − T {i1 ,i2 ,i3 } X{i1 ,i2 ,i3 } , Y {i1 ,i2 ,i3 }

= t {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 } + t i3 x i3 , y i3

−T {i1 ,i2 } X{i1 ,i2 } , Y {i1 ,i2 } − T i3 Xi3 , Y i3

≤ t {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 } + t i3 x i3 , y i3

−T {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 } − T i3 x i3 , y i3
Optimal Layered Defense For Site Protection 15

≤ t {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 } + t i3 x i3 , y i3

−T {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 } − T i3 x i3 , y i3

= t {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 } − T {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 }

+ t i3 x i3 , y i3 − T i3 ( x i3 , y i3 )

+ t {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 } −t {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 }

+ t i3 x i3 , y i3 −t i3 ( x i3 , y i3 )
√ √ √
≤ 4 2|J |L + 2 2(|J | − 1)L + 2 2L
√
≤ 6 2(|J |)L (13)

Proceeding by induction, we know that .ξ I the error of the discrete approximation

for the entire set of internal sensors I and all of their adjacent outside sensors is
bounded by
√
ξ I ≤ 2 2|I ||J |L
.

Therefore,

lim ξ I = t I (XI , Y I ) − T I (XI , Y I )

→0
√
. ≤ lim 2 2|I ||J | (14)
→0

Hence, as . → 0 the discrete approximation .T I (XI , Y I ) converges to the true

continuous optimal detection value .t I (XI , Y I ).

7 The Case of an Adaptive Adversary

So far, our model assumed a fixed flow of dangerous material on each pathway,
and we have presented a method that would allow law enforcement officials to
use current information on attacker behavior to maximize the amount of captured
illegal or dangerous contraband. However, we can think of attackers as intelligent
adversaries who would adjust their strategy once they observe the changes in site
security. Therefore, the goal of a defensive strategy could be to make sure that no
path leading into the site has a violation detection rate that is unreasonably low. For
16 T. Asamov et al.

example, suppose we have an adaptive adversary who recognizes how much of a

resource we use for sensors on each node and then chooses the path that minimizes
the probability of detection. To defend against such an adversary we might seek to
assign sensor resources so as to maximize the minimum detection rate on any path.
Hence we face the following optimization challenge:

y y
max min Dj (yj ) + Dix (xi )(1 − Dj (yj ))
x,y i∈I
j ∈N (i)

s.t. xi ≤ X
i∈I
. (15)
yj ≤ Y
j ∈J

xi ≥ 0, ∀i ∈ I
yj ≥ 0, ∀j ∈ J

In order to solve this problem we can use a similar approach to the one discussed in
the previous section.
Steps 1 and 2
These are identical to their counterparts described in Sect. 4, with .Fj = 1 for every
outside sensor j .
Step 3
Once again, we denote by .T i,j the resulting table of values generated at Step 2.
More specifically, we denote with .T i,j (Xi , Yj ) the optimal detection value that can
be achieved by investing .(Xi , Yj ) ∈ X ×Y resources of respectively, inner and outer
resources. Then we consider the case of two outside sensors .j1 , j2 that are backed
up by inner sensor i. We can merge .T i,j1 and .T i,j2 into a single solution according
to:

T i,{j1 ,j2 } Xi,{j1 ,j2 } , Y i,{j1 ,j2 } =
⎧ ⎫
⎪
⎪ max min T i,j1 Xi,{j1 ,j2 } , y i,j1 , T i,j2 Xi,{j1 ,j2 } , y i,j2 ⎪⎪
⎪
⎪ ⎪
⎪
⎪
⎪ y i,j1 ,y i,j2 ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ s.t. ⎪
⎪
.
⎪
⎨ ⎪
⎬ (16)
= y i,j1 + y i,j2 ≤ Y i,{j1 ,j2 }
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪ i,j1
⎪ ⎪
⎪
⎪
⎪ y ∈ Y ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎩ i,j2 ⎪
⎭
y ∈Y

where .T i,{j1 ,j2 } Xi,{j1 ,j2 } , Y i,{j1 ,j2 } is the optimal detection value for the combined
system.
Optimal Layered Defense For Site Protection 17

Again, if we proceed by induction, we can generate an optimal value table for a

problem instance that involves one inner sensor i, and an arbitrary number of outside
sensors. We denote such a table by .T i .
Step 4
Consider two matrices .T i1 and .T i2 that correspond to respectively inner sensors
.i1 and .i2 with all of their adjacent outside sensors. We could again merge the two

solutions into a single global solution according to the following rule:

⎧ i i i ⎫
⎪
⎪ max min T i1 i1
x , y i1
, T 2 x 2, y 2 ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ x i1 ,x i2 ,y i1 ,y i2 ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ s.t. ⎪
⎪
⎪
⎪ ⎪
⎪
⎨ {i1 ,i2 } ⎬
{i1 ,i2 } {i1 ,i2 } {i1 ,i2 } x +x ≤X
i1 i2
.T X ,Y =
⎪
⎪ ⎪
⎪
⎪
⎪ y i1 + y i2 ≤ Y {i1 ,i2 } ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ i1, x 2 ∈ X
i ⎪
⎪
⎪
⎪ x ⎪
⎪
⎪
⎪ ⎪
⎪
⎩ i1 i2 ⎭
y ,y ∈ Y
(17)

where .T {i1 ,i2 } X{i1 ,i2 } , Y {i1 ,i2 } denotes the optimal value for the combined system
that employs internal sensors .i1 and .i2 , and all of their outside neighbors.
Once again, if we have a third branch consisting of inside sensor .i3 and its outside
neighbors, then we can use .T {i1 ,i2 } and .T i3 as inputs to Eq. (17) and find the table
of optimal values .T {i1 ,i2 ,i3 } for the combined system consisting of inside sensors
.i1 , i2 , i3 , and all of their adjacent outside sensors. Proceeding by induction, we know

that even if we have an adaptive adversary, we can solve problems involving an

arbitrary number of interior and exterior sensors, as well as sensor detection curves
specified by concave increasing piecewise linear functions.

8 Computational Results

In this section we present computational results for the methods developed above.
The experiments were performed on an AMD Phenom X4 9550 workstation with
6GB of DDR2 RAM. We consider two different system configurations, and for each
of them we provide plots of the objective function value for both the original and
adaptive adversary models.
Example 8.1 In this example we consider an inner layer consisting of four sensors,
one with three adjacent outside sensors (indices 1, 2, 3), a second with two adjacent
outside sensors (indices 4, 5), a third with two adjacent outside sensors (indices 6, 7),
and a fourth with two adjacent outside sensors (indices 8, 9). For each inside sensor
.i ∈ I , we specify .D (x) = min{0.2x, 0.4 + 0.1x}. Further, for the first outside
x
i
18 T. Asamov et al.

Fig. 9 Solution maximizing the expected amount of captured contraband for a range of interior
and exterior budgets for Example 8.1

y
sensor we use .Dj1 (y) = min{0.3y, 0.3+0.1y, 0.5+0.05y}, and for outside sensors
y
of index .j = 2, 3, . . . , 9, we use .Dj (y) = min{0.3y, 0.3 + 0.1y}. In addition, all
outside sensors have exactly 1 unit of incoming flow. Figure 9 gives the solution
maximizing the expected amount of captured contraband for a range of interior and
exterior budgets, i.e., the solution to the first problem. The solution matrix includes
10,302 distinct points and the computation took 117 seconds.
We can also calculate the adaptive adversary solution maximizing the minimum
probability of capturing contraband along all paths for a range of interior and
exterior budgets. The solution shown in Fig. 10 includes 40,401 distinct points and
the computation took 3102 seconds (52 minutes).

Example 8.2 In this example, we modify Example 8.1 so that all the outside sensors
have exactly 1 unit of incoming flow except for outside sensors 1 and 9 which
have 10 units of incoming flow. Figure 11 shows the optimal objective values of
the maximized amount of captured contraband for a range of interior and exterior
budgets. The solution table includes 10,302 distinct points, and the computation
took 119 seconds.
In this example we only changed the flow values of Example 8.1. For this reason,
we do not need to compute a new adaptive adversary solution, as it would be
identical to the one for Example 8.1. Naturally, this illustrates the robustness of
the adaptive adversary formulation compared to its original counterpart.
Optimal Layered Defense For Site Protection 19

Fig. 10 Adaptive adversary solution maximizing the minimum probability of capturing contra-
band along all paths for a range of interior and exterior budgets for Example 8.1

Fig. 11 Solution maximizing the expected amount of captured contraband for a range of interior
and exterior budgets for Example 8.2

9 Closing Remarks

We have considered the problem of determining the optimal resource allocation for
layered security. A computational method for the maximization of captured contra-
20 T. Asamov et al.

band and an adaptive adversary approach for the maximization of the worst case
probability of detection have been developed. Both methods are computationally
tractable and can be applied to non-trivial practical problems.
We have a great deal more that we can do in the future. One thing is to consider
both legal and illegal flow, which we also refer to as respectively good and bad
units. Hence, in addition to detecting bad units we could consider false positive
decisions for each sensor and adopt a risk-averse optimization approach [3]. Another
possible direction would be attempting to write the problem as a large game and use
approximation methods similar to the ones developed by Grigoriadis and Khachian
[10]. Alternatively, we could look into interdiction on planar graphs methods similar
to the ones developed by Zenklusen [18, 19].
Still another approach is to follow the applications of Stackelberg games that
have been used in pioneering defensive approaches at the nation’s airports, ports,
and in applications by the Federal Air Marshals Service, US Coast Guard, etc. (see
[11, 15]). In a Stackelberg game between an attacker and a defender, the defender
(security) acts first. The attacker can observe the defender’s strategy and choose
the most beneficial point of attack. The challenge is to introduce some randomness
in the defender’s strategy to increase the uncertainty on the part of the attacker.
Bayesian Stackelberg games do exactly that. Layered defense makes this into a new
kind of Stackelberg game to analyze, one with two rounds, one involving the outer
layer and one involving the inner layer based on results at the outer layer. We can
look both at nonrandomized and randomized strategies for the defender.
There are many other directions in which this work could go. Even with our
current model, we have not yet developed practical methods to handle more than
two layers of defense. There are also many variations on our model that could
be quite interesting. For example, we could consider a fixed resource limit that
the defender could allocate between inner and outer layers. Then, we could allow
adaptive redistribution of resources across layers and across time (see [2, 4]).

Disclaimer The views and conclusions contained in this document are those of the authors and
should not be interpreted as necessarily representing the official policies, either expressed or
implied, of the U.S. Department of Homeland Security.

Acknowledgments This material is based upon work supported by the U.S. Department of
Homeland Security under Grant Award Numbers 22STESE000010101 from DHS Office of
University Programs via Northeastern University, DHS 2009- ST-061-CCI002 from DHS Office
of University Programs to Rutgers University, and HSHQDC-13-X-00069 from DNDO. PBK
acknowledges support from the NSF Grants Numbered 1247696 and 2041428, and the Department
of Industrial and Systems Engineering, University of Wisconsin, Madison.

References

1. Anand, S., Madigan, D., Mammone, R., Pathak, S., Roberts, F.: Experimental analysis of
sequential decision making algorithms for port of entry inspection procedures. In: Intelligence
and Security Informatics, pp. 319–330. Springer (2006)
Optimal Layered Defense For Site Protection 21

2. Asamov, T., Powell, W.B.: Regularized decomposition of high-dimensional multistage stochas-

tic programs with Markov uncertainty. SIAM J. Optim. 28(1), 575–595 (2018)
3. Asamov, T., Ruszczyński, A.: Time-consistent approximations of risk-averse multistage
stochastic optimization problems. Math. Program. 153(2), 459–493 (2015)
4. Asamov, T., Salas, D.F., Powell, W.B.: SDDP vs. ADP: the effect of dimensionality in
multistage stochastic optimization for grid level energy. storage. Preprint. arXiv:1605.01521
(2016)
5. Boros, E., Elsayed, E.A., Kantor, P.D., Roberts, F., Xie, M.: Optimization problems for port-
of-entry detection systems. In: Intelligence and Security Informatics, pp. 319–335. Springer
(2008)
6. Boros, E., Fedzhora, L., Kantor, P.B., Saeger, K., Stroud, P.: A large-scale linear programming
model for finding optimal container inspection strategies. Naval Res. Logist. 56(5), 404–420
(2009)
7. Boros, E., Goldberg, N., Kantor, P.B., Word, J.: Optimal sequential inspection policies. Ann.
Oper. Res. 187(1), 89–119 (2011)
8. Carpenter, T., Cheng, J., Roberts, F., Xie, M.: Sensor management problems of nuclear
detection. In: Safety and Risk Modeling and Its Applications, pp. 299–323. Springer (2011)
9. Elsayed, E.A., Young, C.M., Xie, M., Zhang, H., Zhu, Y.: Port-of-entry inspection: sensor
deployment policy optimization. IEEE Trans. Autom. Sci. Eng. 6(2), 265–276 (2009)
10. Grigoriadis, M.D., Khachiyan, L.G., Porkolab, L., Villavicencio, J.: Approximate max-min
resource sharing for structured concave optimization. SIAM J. Optim. 11(4), 1081–1091
(2001)
11. Jain, M., Tsai, J., Pita, J., Kiekintveld, C., Rathi, S., Tambe, M., Ordóñez, F.: Software
assistants for randomized patrol planning for the LAX airport police and the federal air marshal
service. Interfaces 40(4), 267–290 (2010)
12. Madigan, D., Mittal, S., Roberts, F.: Sequential decision making algorithms for port of entry
inspection: Overcoming computational challenges. In: Intelligence and Security Informatics,
2007 IEEE, pp. 1–7. IEEE (2007)
13. Madigan, D., Mittal, S., Roberts, F.: Efficient sequential decision-making algorithms for
container inspection operations. Naval Res. Logist. (NRL) 58(7), 637–654 (2011)
14. Pita, J., Jain, M., Marecki, J., Ordóñez, F., Portway, C., Tambe, M., Western, C., Paruchuri,
P., Kraus, S.: Deployed ARMOR protection: the application of a game theoretic model for
security at the Los Angeles International Airport. In: Proceedings of the 7th International Joint
Conference on Autonomous Agents and Multiagent Systems: Industrial Track, pp. 125–132.
International Foundation for Autonomous Agents and Multiagent Systems (2008)
15. Sandler, T., Arce, D.G.: Terrorism: a game-theoretic approach. In: Handbook of Defense
Economics, vol. 2, pp. 775–813 (2007)
16. Stroud, P.D., Saeger, K.J.: Enumeration of increasing boolean expressions and alternative
digraph implementations for diagnostic applications. In: Proceedings, vol. 4, pp. 328–333
(2003)
17. Young, C.M., Li, M., Zhu, Y., Xie, M., Elsayed, E.A., Asamov, T.: Multiobjective optimization
of a port-of-entry inspection policy. IEEE Trans. Autom. Sci. Eng. 7(2), 392–400 (2010)
18. Zenklusen, R.: Extensions to network flow interdiction on planar graphs. Preprint.
arXiv:0801.1737 (2008)
19. Zenklusen, R.: Network flow interdiction on planar graphs. Discrete Appl. Math. 158(13),
1441–1455 (2010)
Developing and Running a Set
of Competitive College/University
Blended Courses

Alexander S. Belenky

1 Introduction

Distance learning is one of the rapidly growing directions in higher education, and,
according to numerous reports, only in the U.S., it currently embraces millions of
students [1]. Higher education specialists continue to discuss various philosophical
and pedagogical problems of distance learning, including those related to the quality
of education that it may provide (compared to a traditional off-line higher education)
[2]. At the same time, the distance learning economics becomes one of the areas
which starts to concern administrations of all the colleges and universities at which
the courses available via the Internet are or are planned to be in use [3].
One of the topics that is actively discussed in scientific publications in the frame-
work of the economics of distance learning deals with specifics of the so-called
blended courses [4–7]. A group of such courses is formed by those incorporating
recorded fragments from particular courses available on the Internet into lectures
and seminars that are taught by college/university professors in classrooms offline.
The available data on studies analyzing the effectiveness of various forms of dis-
tance learning in general suggest that from the viewpoint of the quality of education,
blended courses can be not less effective than recorded ones [8, 9]. Moreover, the
so-called “peer effect”—which is inevitably present in all the blended courses—
apparently, positively affects this quality [10, 11]. However, running a blended
course on a particular subject may require the college/university administration to
provide more professors and teaching assistants than it would provide for an offline

A. S. Belenky ()
Department of Mathematics, Faculty of Economic Sciences and International Centre of Decision
Choice and Analysis, The National Research University Higher School of Economics, Moscow,
Russia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 23

course on the same subject. Yet, in most cases, such an expansion of the teaching
personnel can be done only within financial boundaries of the college/university
budget.
Meeting desirable quality requirements within certain financial restrictions is
a typical decision-making problem that is solved in industrial, transportation,
agricultural, business, financial, and other systems (See, for instance [12–18].)
These systems have long been using decision support tools letting their users find
the best allocation of those financial resources that they can afford to spend in order
to be competitive in corresponding markets. It seems that the administration of every
college/university would benefit from having similar tools at its disposal to meet the
challenges that the education market of potential new students currently poses and
will pose in the future. This is the case since all the colleges/universities compete,
particularly, for new students interested in good higher education.
One should draw attention of college/university administrations to the fact that
if properly advertised, an obligation to offer blended courses that are based on
recorded lectures and seminars of professors from the most famous universities in
the world is a factor positively affecting the student enrollment. Moreover, adopting
such a strategic decision of providing thus designed and advertised blended courses
is likely to affect positively the college/university position in the education market
in general, along with its budget, due to at least two reasons.
First, almost all the potential college/university new students who are interested
in acquiring knowledge (rather than only a diploma) dream on studying at Harvard,
MIT, Cambridge, Oxford, Princeton, Stanford and other world-famous universities.
Yet, many of these students are unable to study there due to either low high
school grades or financial reasons (or both). If a particular college/university could
convince potential new students of the above-mentioned kind that they would listen
to the same lecturers on most of the courses as do students from these famous
universities while being (a) offered effective tutorials prior to the start of every
course, and (b) explained all the course nuances in a simple manner understandable
to them, this would make a difference in making enrollment decisions by these
potential new students that are favorable to this college/university.
Second, if all such potential college/university new students were offered to study
both these courses and tutorials at an affordable cost—which may, eventually, be
lower than the cost of studying the same subjects off line at other colleges/universi-
ties—their intent to enroll at this college/university would become even stronger.
Thus, the question that a college/university administration—interested in running
blended courses there in principle—needs to answer is: How to make the above
education strategy a reality, and how to convince potential new students to enroll at
this college/university?
Two sets of problems are to be addressed by the interested college/university to
answer this question.
The first set includes problems associated with finding such a structure of each
particular blended course (which is based on the above-mentioned recordings) that
would help the college/university administration convince the potential new students
that from the very first lectures, they would become sure to succeed in studying the
Developing and Running a Set of Competitive College/University Blended Courses 25

course. (Certainly, the administration should make it clear that this success can come
only if the enrolled students (a) strictly follow recommendations of the teaching
personnel assigned to run this course, and (b) bend every effort to succeed.)
The second set includes problems associated with financial aspects of organizing
and running blended courses in a manner making it financially affordable to both
the college/university and the students.
Addressing problems from the first set (associated with choosing the structure of
each blended course) implies conducting comprehensive studies on how different
categories of students focus on both the subject of a lecture in general and particular
information or techniques discussed in the course. Also, the studies are to determine
the best way to run the lecture by choosing an optimal sequence of fragments from
the recorded courses and the explanations of the scope of these fragments that
should either precede a particular fragment or immediately follow it. These studies
should lead to designing testing questions to be asked by the teacher running the
course before going to the next fragment. However, all this should be done in a
manner that would not turn the lecture into a discussion with the audience that
may consume much of the lecture time. Finally, the structure of tutorials offered
by the college/university to let the students succeed in studying each blended course
designed in the chosen manner should be determined and announced in advance.
Published recommendations of the teachers, along with, possibly, even special
consultations of these teachers, can make a difference in the effectiveness of such
study results. Certainly, the experience and pedagogical skills of those to be chosen
to run blended courses matter a great deal. It’s especially so if every teacher chosen
to teach a blended course manages to test a particular version (or even different
particular versions) of this blended course that she/he proposes to run on groups of
the students similar (or at least close) to those who are expected to study the course
at the college/university.
Addressing problems from the second set implies the determination of
(a) the minimal budget to organize and run the teaching of a set of chosen blended
courses to secure such a percentage of the students (who are to study courses
from this set) expected to succeed in studying each particular course from the
set that would not be lower than a certain desirable one, and
(b) the maximal percentage of the students (who are to study blended courses from
this set of the chosen courses) expected to succeed in studying all the courses
from the set under a particular budget.
Though with respect to the first set of problems, there is a room (if not a
necessity) for the use of mathematical methods, these problems are currently
discussed in scientific publications on this subject mostly at the level of hypotheses
[19–21]. The authors of these publications either only set or set and verify the
proposed hypotheses by conducting polls among both the students and the teachers.
Problems from the second set are rarely considered in scientific publications at all,
and no helpful quantitative analysis of these problems have so far been offered.
The latter leaves college/university administrations “unarmed” in dealing with the
26 A. S. Belenky

economics of organizing and running blended courses or incorporating other forms

of distance learning into the curricula there.
The present paper focuses on those problems from the second set of the above-
mentioned ones that are associated with finding and analyzing financial options
available to a college/university administration interested in organizing and running
blended courses of the considered kind. Particularly, the paper aims at providing
a description of a decision-support tool for college/university administrations that
lets them quantitatively estimate the budget needed to design and run a set
of blended courses of any chosen structure. Proceeding from this budget, the
administration of an interested college/university can certainly calculate tuition
fees for those its potential new students who are to pay for their education in line
with the college/university financial policies being currently in force. (However, a
description of corresponding calculation schemes and methodologies implementing
these policies lies beyond the subject of the present paper.)
The proposed tool can help administrations of interested colleges/universities
develop an optimal strategy of organizing and running blended courses with the
use of available recorded materials (courses) from leading universities in the world
(as well as, certainly, with the use of those from any other universities). It’s
implemented by means of standard software packages that can particularly be run
on personal computers.
As far as the author is aware, the proposed decision-making tool is the first
one capable of solving particular economic problems that the administration of a
college/university interested in organizing and running blended courses faces in its
attempt to make the distance learning a part of the education process there.
Besides the Introduction, the paper includes three more sections. In Sect. 2, a
brief review of research publications in five important (from the author’s viewpoint)
areas associated with organizing the use of recorded courses in distance learning
in general and in running blended courses at colleges/universities in particular
is presented. In Sect. 3, a new mathematical model proposed by the author
for calculating an optimal strategy of hiring teachers to run a set of blended
courses with the use of recorded materials is described. Also, in this section, two
integer programming problems, formalizing those outlined in the Introduction, are
formulated on the basis of this model. Section 4 indicates, in particular, a set of
problems associated with running blended courses that could present interest for
college/university administrations and should be researched in the future.

2 A Brief Review of Research Publications in Five Major

Areas Related to the Use of Blended Courses in the
Education Process

Despite (a) the importance of using online courses in blended learning as a

way to dramatically improve the quality of education worldwide, and (b) the
obvious necessity to address the economics of hiring new university teachers to
Developing and Running a Set of Competitive College/University Blended Courses 27

run particular blended courses, both those from abroad and from other universities
in the country, these problems don’t seem to have been studied quantitatively in
scientific publications. As far as the author is aware, only qualitative considerations
of the above two aspects of online education, both in general and with respect
to blended courses, have so far been presented either in the form of case studies
or in that of open discussions. As one can see from the publications cited in the
Introduction, papers covering at least five important aspects of education related to
blended learning—(a) online distance learning advantages in general, (b) blended
courses specifics, (c) the effectiveness of blended learning, (d) peer-effect in blended
learning, and (e) choosing the structure of blended learning—are those on only
qualitative aspects of the corresponding topics.
The aim of the review presented below is to give the reader an impression on
(a) the substance of the topics covered by contemporary publications on blended
learning, and (b) the state of affairs in this field, which has motivated the proposed
quantitative analysis of the problem of using available online courses in designing
blended ones.
Online Distance Learning Problems An important mission of online education—
which is close to the topic of the present paper—consists of providing opportunities
for bringing a high quality of education offered by the most prestigious col-
leges/universities in the world to colleges/universities of a (currently) modest level
of education quality. By reducing their expenses, the latter colleges/universities
may be able to make a higher level of education quality financially affordable
to both them and their students by widely using online courses developed by
leading colleges/universities in the world [3], particularly, by incorporating recorded
fragments from these online courses into blended courses that they can offer. To
implement this mission, as well as other social missions of online education, two
groups of problems associated with distance learning are to be studied.
The first group of these problems includes those associated with “distance learn-
ers” such as communications among online learners [2], approaches to structuring
assessments of studying parts of an online course, which affect student’s learning
strategies [22], creating appropriate social environments affecting the motivation
to learn [23] and encouraging the presence of the so-called “massive learning”
phenomenon [24], choosing the structure of instructions provided in all forms of
online learning [25], along with teacher-student relationships that affect student’s
achievements [26], the effectiveness of online learning in the form of discussions
vs. so-called “silent learning” [27], pedagogical aspects of the knowledge perception
by the learners and the behavior of participants of the online education, along with
micro- and macro-level environment surrounding this activity [28].
The second group of the problems includes those associated with teachers
and administrations of colleges/universities involved in running distance learning
courses such as preparing teachers for running online courses [29], encouraging
teachers to switch to teaching online [30] or/and to using online materials in blended
educational courses, particularly, by providing information on the effectiveness
of online education [31] and comparing this effectiveness with that of traditional
28 A. S. Belenky

learning [32], along with examples of successfully substituting off-line classes

with distance learning [33], describing principal advantages of online learning
[34], designing learning communities [35], and providing accessibility to online
education to people with disabilities [36].
Both groups of the above problems are studied in the framework of surveys of
answers to some questionnaires [37–40] or case studies [22, 23, 36, 41].
Blended Courses Specifics A comparison of the students’ performance in tra-
ditional studies and in blended learning is offered in [4], where the authors,
particularly, suggest that students with high great point averages in prior studies
achieve better results in studying blended courses. Based upon reports of 74 second-
year students on their self-control and self-regulated learning skills, the authors
of [5] conclude that these skills, along with the actual participation in (attending)
the course, are key factors to predict the final grades for an online course. In [6],
predicting the final outcome of studying a blended course and detecting students
who are at risk of poorly studying a particular course is studied by considering 29
“usage variables”. The author of [6] suggests that only 14 of these variables turned
out to be significant, including four variables that allowed one to predict the learning
outcome with a reasonable accuracy. An analysis of the blended learning impact on
studying both STEM and non-STEM disciplines, presented in [42], suggests that
studying STEM disciplines in the framework of blended learning is more effective
than studying them in the framework of traditional studies. At the same time, a
comparison of exam and quiz results in blended and traditional courses, presented
in [7] for a set of introductory economic courses, did not discover substantial
advantages of blended learning.
Eight problems that the teachers who conduct blended courses face are described
in [43] as those being part of three inductive categories—instructional processes,
community concerns, and technical issues—detected as a result of processing the
response of 117 teachers from four Universities in Turkey to questions they were
asked in the interviews. Results of a study reported in [44] suggest that only
a minority of all the students use particular tools in studying blended courses
supported by content management systems (CMS), and the students regulate the
use of these tools as the blended course unfolds. Students’ approaches to inquiry
and collaborations in the course of working on a blended course are studied in
[45], and a division of the students into subgroups within a group of more than 200
students, with similar approaches to the inquiries and learning technologies within
each subgroup, is reported there with respect to a particular one-semester blended
course. The authors of this study believe that such a division helps explain why
students from some of the subgroups are more successful in studying the course
than students from the other ones.
Four key challenges in designing blended courses—(1) incorporating flexibility,
(2) stimulating interaction, (3) facilitating students’ learning processes, and (4) fos-
tering an emotional learning climate—and approaches to resolving corresponding
problems are outlined in a review of 640 sources and 20 studies, presented in [46].
Developing and Running a Set of Competitive College/University Blended Courses 29

Patterns of the students’ evaluation of and experience with online, blended, and
face-to-face courses are compared in [47].
The Effectiveness of Blended Learning A case study related to implementing a
blended learning approach to developing a course is presented in [48], where
both students’ responses to the blended learning environment and thoughts of
the course author on the requirements that a successful blended course should
meet are discussed. An approach to integrating online and face-to-face learning
with blended learning, which was successfully implemented at Suleyman Demirel
University via a University Learning Management Systems (LMS) with respect to a
particular computer engineering course, is described in [9]. A set of bottlenecks
in contemporary higher education, which is described in [49], is considered by
the authors, particularly, from the viewpoint of the ability of blended courses
to contribute to effectively solving corresponding problems of managing higher
education.
Twenty research studies analyzed in the framework of a review, presented in
[50], let the review authors suggest that there exist two groups of main factors,
each affecting the solving of creative problems in a blended learning environment,
that should be considered. The first group includes four particular factors affecting
the solving process, whereas the remaining five factors, determining the blended
learning environment, form the second group. Results of an experiment with
the MAgAdI on-line system—an adaptive, integrated into the learning process
web environment, developed to support the learning processes in which several
knowledge fields, courses, and teachers get involved—are discussed in [8]. The
authors of [8] assert that the use of that on-line system helps successfully blend
traditional teaching methods with various learning environments.
Peer Effects in Learning Peer effects—as a phenomenon associated with an
affection that the interaction with peers may have on a person’ behavior in a
group learning—have some general features. This is why these effects are currently
studied mostly in general though they may have certain specifics in blended learning
and affect compositions of learners (who take blended courses) to achieve better
academic results.
Fundamentals of peer effects in higher education, including the definition of
this phenomenon, are discussed in [10], where the authors assert that these effects
affect the economics of higher education, mostly since (according to these authors),
they “eliminate awkward anomalies in the institutional behavior of colleges and
universities and in the economic structure of higher education as an industry if they
exist.” Three interaction patterns of peer talks—a tutor-led question-and-answer
pattern, a cumulative-exploratory pattern, and a dispute-exploratory one—within
a particular educational environment in which third-year students are to train first-
year ones are identified in [51]. The authors of that paper believe that using the
identified patterns may improve the effectiveness of studies with respect to all the
subjects of learning.
The underlying mechanism of peer effects is studied in [11] with respect to a
group of first year students for which the academic abilities of different student
30 A. S. Belenky

subgroups are measured and compared. According to the authors of that paper, (a)
male, minority, and low-ability students are affected by their peers the most, and (b)
the transfer of general knowledge within all the subgroups has more effect than that
of specific knowledge.
A theoretical model attempting to explain (and compare) different peer effects
in different fields of studies based upon researching the behavior of the first year
students in a middle-sized public university in Italy is presented in [52]. Peer
effects in students’ academic performance are studied in [53] based upon a set of
publications in the field, and one of the findings, presented in that paper, suggests
that high ability students have higher positive peer effect on other high ability
students at both school and college/university level. At the college/university level,
the findings suggest that the background of roommates affects students’ academic
performance.
Choosing the Structure of a Blended Course An approach to structuring a blended-
learning bachelor program in electrical engineering, designed for a group of students
who study this subject alongside with working (as being employed), is discussed in
[21]. Several options for information exchange such as establishing a connection
with the course of mathematics, retrieving references needed to study the course,
electronic module questionnaires, and a feedback channel functioning during the
whole course are used in structuring the course. A methodology (Q-methodology)
to designing blended education courses is considered in [19] to identify learning
perspectives of the students enrolled, along with peculiarities of their perceptions
of the course, which helps estimate possible student achievements as a result of
studying the course. The author of [19] identifies four perception types and offers
his observations on the academic achievements of the students with some of those
perception types.
A mathematical thinking approach underlying the structure of a blended course
in part of a calculus course related to multivariable calculus is proposed in [54],
where the authors assert that creating a blended learning environment is adequate to
developing mathematical thinking in students. A set of guidelines that are based on
theories of instructional design for writing instructors, offered in [55], are aimed at
supporting effective student learning. The authors of that publication suggest that in
designing the course, the instructors should focus on five issues: (1) a substantiation
of the need for the course, (2) the structure of the audience to learn the course,
(3) advantages of developing the course in an on-line form, (4) basic pedagogical
principles, and (5) available resources.
A concept of blended learning, along with basic elements of its teaching mode
frame, is discussed in [56]. A set of guides, documents, and publications related to
practical aspects of designing blended courses, along with common principles of
designing such courses, an appropriate terminology, and strategies to use to meet
accreditation requirements, is considered in [57]. The authors of that publication
discuss approaches to using online technologies and to the course implementation,
along with the problems and difficulties to bear in mind in designing blended
courses, and outline directions of further research in this field. Detailed instructions
Developing and Running a Set of Competitive College/University Blended Courses 31

on designing blended courses to achieve learning objectives, along with certain

tools and templates, are offered in the books [20] and [58]. The books contain both
theoretical and practical considerations related to designing blended courses and
reflect the experience of their authors in developing such courses and in converting
traditional courses into blended learning ones.
As one can see from the presented review, and as mentioned in the Introduction to
this paper, all the discussions and reasonings in all the above five areas are conduced
at the level of philosophical and pedagogical observations and suggestions. No tools
for either a formalized analysis or a conversion of these discussions and reasonings
into practical activities have so far been proposed.
At the same time, it is clear that the college/university competitiveness in
the market of potential new students cannot be based on such observations and
discussions only. College/university administrations that possess tools for making
strategic decisions based on a formalized financial analysis of which economically
affordable decisions can and which ones should be made will undoubtedly have
substantial competitive advantages in the above-mentioned market.

3 The Statements of the Problems and Their Mathematical

Formulation

A University/College (for the sake of definiteness, a University further in this section

of the paper) intents to start teaching blended courses in one of foreign languages,
for instance, in English, and it’s to decide how to organize this activity. The number
of blended courses to be taught in English equals K, and the number of teachers-
native speakers of English language—can (a) cover all these K courses, and (b) be
hired by the University from abroad—equals (or doesn’t exceed) L. Besides hiring
teachers from abroad to run these K blended courses, the University considers a
possibility to offer to teach some courses from this set of K blended courses or all
the K blended courses to teachers-native speakers of the country’s language who
speak English well. To this end, the University considers potential candidates to
teach blended courses from among those who are currently employed either by the
University or by other universities in the country.
The University plans to buy (or to take from the open sources) recorded online
courses that are taught by distinguished professors from leading universities in
the world and to use these recordings in designing all the above-mentioned K
blended courses, no matter who will finally be chosen to teach these courses. Each
of the teachers (a) invited from abroad, (b) invited from other universities in the
country, and (c) currently employed by the University, who are invited to teach each
particular course from the set of K blended courses, is to teach it in one and the
same manner. That is, she/he is to teach each such course in the form of lectures and
seminars, and these lectures and seminars are to be based on or substantially use the
above-mentioned acquired materials from the recorded online courses.
32 A. S. Belenky

Each of the teachers considered by the University to be invited to teach courses

from the set of K blended courses (from among those who can be chosen
from professors currently working at the University, or can be hired from other
universities in the country, or can be invited from abroad) can teach no more than
a certain number of these blended courses. If a teacher from the University is
assigned to teach some courses from the set of K blended courses, her/his existing
assignments for teaching courses in the country’s language are to be covered by
other teachers either currently working at the University or by those to be hired
from other universities in the country (though not by any of those invited to teach
courses from the set K) on the hourly basis. Each potential candidate to teach any
particular course from the set of K blended courses is tested by experts recognized
by the University. Such experts are to estimate a percentage of the students who are
likely to succeed in studying this course should this candidate be selected to teach
the course.
The University is interested in estimating two (indicated in the Introduction)
numbers associated with organizing and running the above-mentioned K blended
courses:
(1) What is the minimal budget to organize the teaching of these K blended
courses in English to secure a percentage of the University students (who are
to study courses from this set) expected to succeed in studying each particular
course (from these K blended courses) to be not lower than a certain desirable
percentage?
(2) What is the maximal percentage of the students (who are to study courses from
this set) expected to succeed in studying each particular course from these K
blended courses under any particular budget that the University can afford to
spend to organize and run these courses?
All the teachers-native speakers of the country’s language who are potentially
capable of teaching courses in English from the set of K blended courses (both
from the University and from other universities in the country) are called course
developers further in the paper. If any course from the K blended courses is to be
taught by a teacher-native speaker of English (invited from abroad), the students
assigned to study this course are to take (a) an advanced course in English language,
and (b) corresponding tutorials to be prepared to understanding fragments from the
corresponding recorded courses, prior to the commencement of the course.
Let
• K be the number of blended courses that the University plans to teach in English
yearly, say, in the next T years,
• B be the yearly budget allocated by the University to cover the expenses
associated with developing and running the set of K blended courses within T
years,
• .B0 be the cost of recorded online courses that the University acquires to use in
developing the set of K blended courses,
• M be the number of potential course developers who are currently employed by
the University,
Developing and Running a Set of Competitive College/University Blended Courses 33

• R be the number of teachers currently working at other universities in the country

who are interested in working at the University, and who are considered by the
University as potential course developers,
• L be the number of teachers-native speakers of English from abroad who can
cover the needs of the University in teaching courses from the set of K blended
courses and who the University can financially afford to invite,
• .ci be the basic yearly salary of course developer i (who is a teacher currently
employed by the University) and who is chosen (assigned) to teach courses from
the set of K blended courses, .i ∈ 1, M,
• .∇ik be the additional yearly salary of course developer i from the University,
assigned to teach courses from the set of K blended courses, for developing and
teaching course .k, k ∈ 1, K, i ∈ 1, M,
• .br be the basic yearly salary of course developer r from another university in the
country invited to teach courses from the set of K blended courses, .r ∈ 1, R,
• .δrk be the additional yearly salary of developer r, invited from another university
in the country to teach courses from the set of K blended courses, for developing
and teaching course .k, k ∈ 1, K, r ∈ 1, R,
• .gr be the relocation cost associated with hiring course developer r from another
university in the country to teach courses from the set of K blended courses,
.r ∈ 1, R,

• .al be the basic yearly salary of teacher l to be invited from abroad to teach courses
from the set of K blended courses in English, .l ∈ 1, L,
• .lk be the additional yearly salary of teacher l, invited from abroad to teach
courses from the set of K blended courses in English, for teaching course .k, k ∈
1, K, l ∈ 1, L,
• .hl be the relocation cost associated with the invitation of teacher l from abroad
to teach courses from the set of K blended courses in English, .l ∈ 1, L,
• .di be the per hour salary of a (currently employed by the University) teacher, who
is to substitute course developer i (who is assigned to teach courses from the set
of K blended courses) in all the activities associated with teaching the courses
“vacated” by course developer i, .i ∈ 1, M (if there are such course developers),
• .tik be the number of hours per year that course developer i “vacates” (as a result
of switching to teaching course k from the set of K blended courses) that are
to be covered by other teachers (either by those who are currently employed at
the University or by those to be hired from other universities in the country),
.i ∈ 1, M, k ∈ 1, K,

• .αik be the expert estimate of a percentage of the students expected to succeed

in studying blended course k that is taught by course developer i from the
University, .i ∈ 1, M, k ∈ 1, K,
• .βrk be the expert estimate of a percentage of the students expected to succeed
in studying blended course k that is taught by course developer r invited from
another university in the country, .r ∈ 1, R, k ∈ 1, K,
• .γlk be the expert estimate of a percentage of the students expected to succeed
in studying blended course k that is taught by teacher l invited from abroad,
.l ∈ 1, L, k ∈ 1, K,
34 A. S. Belenky

• .ωk be a desirable (targeted) by the University percentage of the students expected

to succeed in studying blended course .k, k ∈ 1, K,
• .uk be a Boolean variable that equals 1 if course k from the set of K blended
courses is to be taught by a teacher invited from abroad (so that a special English
language course and tutorials are to be run by the University for the students who
choose to study blended course .k ∈ 1, K) and equals 0, otherwise,
• .fk be the yearly cost of running the advanced English language course and the
corresponding tutorials for the students who choose to study blended course
.k, k ∈ 1, K,

• .xik be a Boolean variable that equals 1 if course developer i from the University is
assigned to teach blended course .k, i ∈ 1, M, k ∈ 1, K, and equals 0, otherwise,
• .sr be a Boolean variable that equals 1 if course developer r to be invited from
another university in the country is qualified to teach courses from the set of K
blended courses and equals 0, otherwise, .r ∈ 1, R,
• .yrk be a Boolean variable that equals 1 if course developer r from another
university in the country is invited to teach blended course .k, k ∈ 1, K, r ∈ 1, R
and equals 0, otherwise,
• .wl be a Boolean variable that equals 1 if teacher l from abroad is invited to teach
courses from the set of K blended courses and equals 0, otherwise, .l ∈ 1, L,
• .zlk be a Boolean variable that equals 1 if teacher l from abroad is invited to teach
blended course k and equals 0, otherwise, .l ∈ 1, L, k ∈ 1, K,
• .M(i) ⊂ 1, K be a subset of courses from the set of K blended courses that
course developer i (from the University) can’t teach, .i ∈ 1, M,
• .R(r) ⊂ 1, K be a subset of courses from the set of K blended courses that
course developer r (to be invited from another university in the country) can’t
teach, .r ∈ 1, R,
• .L(l) ⊂ 1, K be a subset of courses from the set of K blended courses that
teacher l (to be invited from abroad) can’t teach, .l ∈ 1, L,
• .νi be the maximal number of courses from the set of K blended courses that
course developer i (from the University) can teach concurrently, .i ∈ 1, M,
• .μr be the maximal number of courses from the set of K blended courses that
course developer r (to be invited from another university in the country) can
teach concurrently, .r ∈ 1, R,
• .πl be the number of courses from the set of K blended courses that teacher l (to
be invited from abroad) can teach concurrently, .l ∈ 1, L, and
• .qk be the yearly budget that the University can spend for additional salaries to
be paid to the teacher who teaches course k from the set of K blended courses
(along with the English language course and corresponding tutorials, if need be)
and to the one who substitutes this teacher (if there is one).
Assumptions
1. The numbers of potential candidates from other universities in the country to
choose from to (a) invite to become blended course developers, and (b) replace
teacher i for the hours “vacated” by this teacher are known large integers.
Developing and Running a Set of Competitive College/University Blended Courses 35

Course developers .i, i ∈ 1, M and .j, j ∈ 1, R are paid an extra salary

for developing and running courses from the set of K blended courses. This
developing is done in line with the structure of each blended course they are
invited to teach, which is to be provided by the University administration.
However, the basic salary of developer .i, i ∈ 1, M remains the same as it
was before the assignment to teach blended courses (and it’s not a part of the
budget B, allocated to cover the expenses associated with hiring teachers to teach
courses from the set of K blended courses).
2. Teachers invited from abroad to teach courses from the set of K blended
courses come with already developed such courses (in line with the structure
of these courses to be provided by the University administration in advance).
Nevertheless, they are paid extra salaries .∇lk , l ∈ 1, L, k ∈ 1, K for teaching
these courses while their basic salaries are specified by corresponding contracts
sighed between each of them and the University administration.
All the salaries paid to each teacher invited by the University—both to each
of those invited from other universities in the country and to each of those
invited from abroad—are part of the budget B, allocated to cover the expenses
associated with hiring teachers to develop and run courses from the set of K
blended courses.
3. For the sake of simplicity, it’s further assumed that both the tutorials and special
English language courses are run for blended course k only if this course is to be
taught by a teacher invited from abroad (though, generally, at least the tutorials
may be run if a teacher of any of the two other kinds is to run this course),
introducing Boolean variables similar to .uk may reflect this case in the system of
constraints (1), (2), to be presented further in this paper).
4. All the contracts with every teacher invited from abroad to teach courses from
the set of K blended courses and with every blended course developer from other
universities in the country, hired by the University to teach blended courses from
this set, are signed for T years.
5. In considering the use of recorded fragments of online lectures in designing
blended courses, along with the invitation of outside teachers to teach blended
courses (including those invited from abroad), the University administration
chooses a total set of blended courses that it plans to start offering at the
University yearly within T years so that no new blended courses are added by
the University within these T years.
6. All the information needed for estimating the values of the parameters .αik ,
i ∈ 1, M, k ∈ 1, k, .βrk , r ∈ 1, R, k ∈ 1, k, and .γlk , l ∈ 1, L, k ∈ 1, k
is considered to be available with respect to all the teachers invited to teach
blended courses from the set K at the University. Teachers from abroad who
are considered to be invited to teach blended courses in line with the required
structure of these courses (to be provided by the University administration)
are asked to provide some of their recorded lectures on the same subjects in
advance, to let the University administration and the experts make corresponding
estimates. To make these estimates, the University arranges testing of the
36 A. S. Belenky

provided recorded materials on a selected group of its students to find out what
kinds of training courses and tutorials are to be run to help the students better
understand both the substance of the blended courses and the language to be
used in presenting (the designed) blended courses to the audience, which these
courses are prepared for.

The systems of constraints

M
R
L
xik + yrk + zlk = 1, k ∈ 1, K,
i=1 r=1 l=1

M
K
xik ≤ 1, k ∈ 1, K, xik ≤ νi , i ∈ 1, M,
i=1 k=1

xik = 0, k ∈ M(i) ⊂ 1, K,

K
yrk ≤ sr ≤ yrk , r ∈ 1, R, k ∈ 1, K,
k=1

R
K
yrk ≤ 1, k ∈ 1, K, yrk ≤ μr , r ∈ 1, R,
r=1 k=1

yrk = 0, k ∈ R(r) ⊂ 1, K,
.
L (1)
zlk − uk = 0, k ∈ 1, K,
l=1

K
zlk ≤ wl ≤ zlk , l ∈ 1, L, k ∈ 1, K,
k=1

L
K
zlk ≤ 1, k ∈ 1, K, zlk ≤ πl , l ∈ 1, L,
l=1 k=1

zlk = 0, k ∈ L(l) ⊂ 1, K,

L
R
lk zlk + δrk yrk +
l=1 r=1

M
M
∇ik xik + di tik xik + fk uk ≤ qk , k ∈ 1, K,
i=1 i=1

and
Developing and Running a Set of Competitive College/University Blended Courses 37

L
K
L
R
K
R
wl (al + hl ) + lk zlk + sr (br + gr ) + δrk yrk +
l=1 k=1 l=1 r=1 k=1 r=1
. (2)

K
M
K
M
K
∇ik xik + di tik xik + fk uk ≤ B − B0
k=1 i=1 k=1 i=1 k=1

should hold for the first year in the set of T years (also, see concluding Remark 9).
Taking into account the system of constraints (1) and (2), the two problems stated
earlier in this section of the paper can mathematically be formulated as follows:
Problem 1 The problem consists of maximizing the minimum percentage of the
total number of students expected to succeed in studying a course from the set .1, K
(when these courses are considered to be equally important from the University’s
viewpoint), and this problem is formulated proceeding from the fixed yearly budget
B (for T consecutive years). This problem is the Boolean programming problem

M
R
L
. min αik xik + βrk yrk + γlk zlk → max (3)
k∈1,K (xik ,yrk ,zlk ,uk ,sr ,wl )
i=1 r=1 l=1

under the systems of constraints (1), (2).

Problem 2 The problem consists of minimizing the maximal value of the yearly
budget for organizing and running courses from the set of K blended courses (which
is that for the first of the T consecutive years), provided the desirable (targeted)
percentages of the students expected to succeed in studying the courses from this
set (determined by the numbers .ωk , k ∈ 1, K) are attained. This problem is the
Boolean programming problem

L
K
L
R
K
R
wl (al + hl ) + kl zkl + sr (br + gr ) + δkr ykr +
l=1 k=1 l=1 r=1 k=1 r=1
. (4)

K
M
K
M
K
∇ik xik + di tik xik + fk uk → min
(xik ,yrk ,zlk ,uk ,sr ,wl )
k=1 i=1 k=1 i=1 k=1

under the system of constraints (1) and the additional system of constraints

M
R
L
. αik xik + βrk yrk + γlk zlk ≥ ωk , k ∈ 1, K. (5)
i=1 r=1 l=1

In both problems, it is assumed that the systems of constraints (1), (2) (in
Problem 1) and (1), (2), (5) (in Problem 2) are compatible, which can be verified,
and appropriate corrections in the systems of constraints (1), (2) can be made with
the use of the technique proposed, in particular, in [59].
38 A. S. Belenky

In Problems 1 and 2, it is assumed that the parameters .αik , βrk , and .γlk —which
are expert estimates of percentages of the students who are to take course .k ∈ K and
are expected to succeed in studying this course if the course is taught by a developer
from the University, or by a developer from another university in the country, or
by a teacher invited from abroad, respectively—are known numbers provided to the
University administration by expects recognized by the University.

4 Concluding Remarks

1. It’s clear that, generally, Boolean programming problems (1)–(3) and (1), (2),
(4), (5) can also serve as mathematical formulations of problems associated with
finding an optimal composition of the teachers to run any set of offline courses at
a college/university, for instance, in physics, mathematics, biology, etc., without
the use of recordings of any lectures on corresponding subjects. If this is the
case, any training courses and tutorials in the corresponding subjects may or
may not be run, and teachers from abroad may or may not be invited to teach the
corresponding offline courses. To formulate both problems in this case, some
variables and parameters, being present in the system of constraints (1), (2),
should be omitted.
Such a possibility to use the proposed models and Boolean programming
problems stems from the obvious observation: The nature of stuffing a set of
blended courses with teachers and that of stuffing a set of offline courses with
ones is the same, and so should be their mathematical formalization. Indeed, in
both cases, an optimal combination of the teachers from any available set of them
to teach courses within certain financial limits should be determined. However,
stuffing offline courses at, for instance, a modest university differs from stuffing
blended ones there.
Indeed, to make blended courses attractive to potential new students, the
college/university administration should
(a) properly advertise these courses as those to be taught with the use of
fragments of lectures given by professors from famous universities in the
world,
(b) convince the potential new students that both the tutorials and the training
courses to accompany each offered blended course (if there are ones to
accompany it) are designed to let every interested student succeed in studying
this course,
(c) pay extra salaries to the chosen teachers for developing each blended
course (in line with its structure to be provided by the college/unversity
administration), and
(d) apply a reliable procedure of estimating the ability of each potential teacher
to teach a blended course
Developing and Running a Set of Competitive College/University Blended Courses 39

to let the expert estimates of the percentage of the students expected to succeed
in studying this course be not lower than a particular desirable number (which is
known for standard offline courses on the same subjects).
Finally, the blended courses to be included in the college/university curricula
should be designed and run in such a manner that the students who are the
most successful in studying the offered blended courses may eventually become
prepared to continue their education in those world-famous universities, whose
recorded fragments of lectures were part of these blended courses. For instance,
upon graduation, bachelor students may succeed in getting admitted to these
universities for master programs, whereas master students may get admitted
for Ph.D. studies there. Needless to say that if this happened, it would be an
excellent contribution to the university’s prestige and would sharply contrasted
the described strategy of designing and running blended courses from that
of running traditional offline ones. The college/university may even organize
subgroups of the most advanced enrolled students interested in making these
moves to the world-leading universities to help them succeed in implementing
their desire. This can be done, particularly, by offering them specially designed
blended courses with a more intensive use of the recorded online lectures given
by distinguished professors at those universities.
2. While designing blended courses based on fragments of recorded lectures of
distinguished lecturers from world-leading universities is a good option for
advanced potential and current college/university students, blended courses for
both less prepared potential and current students may use recordings of ordinary
lecturers, including those offered by college/university teachers. In both cases,
problems (1)–(3) and (1), (2), (4), (5) (with changes reflecting a particular
strategy of designing and running blended courses) can be used for receiving
corresponding quantitative financial estimates.
3. The formulations of Problems 1 and 2 suggest that the proposed decision support
tool may help a college/university to substantiate the budget needs in talks with
legal entities that care about the quality of education that this college/university
provides and its prestige (such as local and federal administrations for public
colleges/universities and contributing sponsors for private ones). Particularly,
this tool helps the college/university substantiate the size of B − B0 —the
college/university’s budget that is planned by the college/university to be spend
for organizing and running K blended courses—particularly, in the talks with
these legal entities associated with the competitiveness of the college/university
in the market of new potential students.
4. In this paper, the competitiveness of a college/university in a market of potential
students is considered with respect to new potential students only. However, it’s
clear that, generally, the competitiveness of a college/university also depends
on how the education process is arranged for the students who have been with
this college/university at least for some time. Appropriate arrangements motivate
such students to complete their education where they are currently enrolled
rather than encourage them to transfer to another college/university to receive a
corresponding degree there. With respect to blended courses, which can be either
40 A. S. Belenky

core ones or electives, the same features that attract the attention of potential
new students should be present in the blended courses offered to already enrolled
college/university students. Also, one should bear in mind that the quality of
teaching these courses affects the intents of new potential students to enroll at
the college/university, since such intents are partly motivated by the information
on this quality received from its current students.
5. The proposed scheme of using fragments from recorded online courses is only
one of the options to use such recordings. Since most of these courses are
available free of charge, the college/university administration may decide to use
online courses as a substitution for offline ones, without designing and running
new (blended) courses. From the author’s teaching experience, it seems hard to
believe that the recorded online lectures can always substitute the presence of a
lecturer live in auditorium for a majority of the students studying these courses
while (at least) not positively affecting the percentage of the students who are
likely to study corresponding online courses successfully. In any case, it looks
reasonable to “test” both approaches to using the recorded online courses before
making a final decision on this matter.
6. Organizing tutorials to prepare particular groups of college/university students
for successfully studying new blended courses is always a challenge. The major
problem here is associated with timely preparing the students to (a) understand
the facts to be presented in a particular blended course, and (b) be comfortable
with the style of the material presentation by the teacher chosen to run this
blended course. The time left for this preparation much depends on when the
selection of the teachers to run the courses is completed, and the results of
providing corresponding tutorials substantially depend on who runs the tutorials
for each new course from the set of K blended courses. The latter may become a
complicated problem when the teacher chosen to run a particular blended course
is not currently employed at the college/university.
Also, one needs to emphasize that a teacher chosen to teach a blended
course at a college/university is free to develop these courses on her/his own,
as long as she/he adheres to the structure of the course determined by the col-
lege/university administration. (However, discussing approaches to developing
particular blended courses in line with their assigned structures lies beyond the
scope of this paper.)
Finally, in the model (1), (2), it was assumed that the tutorials and English
language courses are organized and run only for blended courses to be taught by
the teachers invited from abroad. However, generally, this may not be the case.
Both kinds of course developers (from the University and from other universities
in the country) may require to run both tutorials for the blended courses to be
taught by them and the English language courses to be offered by corresponding
departments at the University.
Let M̃ ⊂ M and R̃ ⊂ R be subsets of those course developers who require
to organize and run tutorials (with or without additional courses in English
language to be organized by the college/university), let f˜ikU and f˜rk OU be the
Developing and Running a Set of Competitive College/University Blended Courses 41

costs or running these tutorials, and let ψ̃ikU and ψ̃ OU be the costs or running
rk
additional English language courses, requested by the developers from the
college/university and by those from other colleges/universities in the country,
respectively. Then, for instance, the first (of the T ) inequalities in (2) will take
the form

L
K
L
R
K
R
wl (al + hl ) + lk zlk + sr (br + gr ) + δrk yrk +
l=1 k=1 l=1 r=1 k=1 r=1

K
M
K
M
K
K
. ∇ik xik + di tik xik + fk uk + f˜ikU xik (6)
k=1 i=1 k=1 i=1 k=1 i∈M̃ k=1

K
K
K
+ f˜rk
OU
yrk + U
ψ̃ik xik + OU
ψ̃rk yrk ≤ B − B0 .
r∈R̃ k=1 i∈M̃ k=1 r∈R̃ k=1

Here, it’s assumed that if developer i ∗ from the set M̃ doesn’t require to
organize and run tutorials for blended course k ∗ ∈ 1, K, the coefficient f˜iU∗ k ∗
equals zero. The same assumption holds for the coefficient f˜rOU ∗∗ k ∗∗ (if developer
∗∗
r from the set R̃ doesn’t require to organize and run tutorials for blended course
k ∗∗ , k ∗∗ ∈ 1, K) and for the coefficients ψ̃iU∗ k ∗ and ψ̃rOU
∗∗ k ∗∗ .
7. Generally, the college/university administration may exercise two approaches to
using recorded fragments of online lectures in designing blended courses, which
differ in the budget for running these courses. That is, it can offer the same set
of blended courses on a yearly basis within, say, T years so that no new blended
courses are added within these T years (see Assumption 5 from Sect. 3). The
other one implies widening the spectrum of blended courses yearly (or even more
often) depending on the results at the end of a particular year (or even on those of
each year), including financial results associated with running blended courses.
One can easily be certain that the proposed model can be used in formalizing
corresponding mathematical problems in the framework of the second approach
though the structure of the system of constraints in the corresponding problems
will be different. That is, this system will have a block-diagonal structure binding
variables related to each particular year (during T years) within a separate block,
along with a block binding all the variables of the system of constraints [60].
8. The proposed tool demonstrates the potential of an adequate mathematical
modelling, systems analysis, operations research techniques, and standard opti-
mization software in solving practical problems arising in the economics of
education.
9. One should bear in mind that if inequality (2) holds for the first year in the set
of T years, the restriction for a yearly budget for all the K courses will also
hold for all the other years from this set (since all the expenses associated with
the relocation of the invited teachers, which are reflected in (2), as well as the
expenses B0 , take place in this first year).
42 A. S. Belenky

10. In the models underlying the formulations of Problems 1 and 2, it’s assumed
that both the set of potential course developers from other colleges/universities
in the country and that from those to be invited from abroad are known to
the college/university administration in advance, before these two problems
are solved. Also, it’s assumed that both the tutorials (for each special course)
and the English language courses, provided by the teachers, are taken by the
students during the period of T years only once. Finally, it’s assumed that any
substitution of the teachers from other colleges/universities in the country and
from abroad for the years from the set of T years can be done only by the
college/university teachers (unless the relocation expenses of an invited teacher,
not currently employed by the college/university, are covered by it), and the
values ci , i ∈ 1, M, are taken into account.
One of further research directions of studying both problems considered in this
paper should concern the development of approaches to solving these problems
under uncertainty conditions. This uncertainty is associated with the impossibility
to determine the exact values of expert estimates of the parameters αik , βrk , γlk —
percentages of the students expected to succeed in studying each of K blended
courses with respect to every teacher considered by the college/university admin-
istration as a candidate for teaching this course—in principle. Though the proposed
tool lets the college/university administration conduct multiple calculations for any
number of sets of these expert estimates that the administration may be interested to
explore, one should understand that information on the values of these estimates is
that of a probabilistic nature. Moreover, establishing any particular regularities and
parameters of distribution laws describing the dynamics of these parameters seems
to present considerable difficulties.
At the same time, there exist approaches to treating similar information (on
parameter values in corresponding mathematical models) that let formulate prob-
lems under uncertainty conditions similar to those considered in this paper as linear
or mixed programming ones solving which can be done by using standard software
packages, available even on PCs [61]. However, the applicability of such approaches
to considered problems depends on whether assertions similar to those presented in
[61] can be mathematically proven.

References

1. Kolowich, S.: Exactly how many students take online courses. The Chronicle of Higher
Education, January 16 (2014)
2. Joksimovic, S., Gasevic, D., Loughin, T.M., Kovanovic, V., Hatala, M.: Learning at distance:
effects of interaction traces on academic achievement. Comput. Educ. 87, 204–217 (2015)
3. Carnoy, M., Kuz’minov, Y.: Online learning: how it affects the university structure and
economics. Panel discussion. Educ. Stud. Moscow, 3, 8–43 (2015)
4. Asarta, C.J., Schmidt, J.R.: Comparing student performance in blended and traditional courses:
does prior academic achievement matter? Internet High. Educ. 32, 29–38 (2017)
Developing and Running a Set of Competitive College/University Blended Courses 43

5. Zhu, Y., Au, W., Yates, G.: University students’ self-control and self-regulated learning in a
blended course. Internet High. Educ. 30, 54–62 (2016)
6. Zacharis, N.Z.: A multivariate approach to predicting student outcomes in web-enabled
blended learning courses. Internet High. Educ. 27, 44–53 (2015)
7. Olitsky, N.H., Cosgrove, S.B.: The effect of blended courses on student learning: evidence
from introductory economics courses. Int. Rev. Econ. Educ. 15, 17–31 (2015)
8. Alvarez, A., Martin, M., Fernandez-Castro, I., Urretavizcaya, M.: Blending traditional teaching
methods with learning environments: Experience, cyclical evaluation process and impact with
MAgAdI. Comput. Educ. 68, 129–140 (2013)
9. Yigit, T., Koyun, A., Yuksel, A.S., Cankaya, I.A.: Evaluation of blended learning approach in
computer engineering education. Procedia Soc. Behav. Sci. 141, 807–812 (2014)
10. Winston, G.C., Zimmerman, D.J.: Peer effects in higher education. In: The Economics of
Where to Go, When to go, and How to Pay for it, pp. 395–421. University of Chicago Press
(2004)
11. Griffith, A.L., Rask, K.N.: Peer effects in higher education: a look at heterogeneous impacts.
Econ. Educ. Rev. 39, 65–77 (2014)
12. Bazerman, M., Moore, D.: Judgment in Managerial Decision Making, 8th edn. Wiley (2012)
13. Gilboa, I.: Making Better Decisions: Decision Theory in Practice, 1st edn. (Wiley/Blackwell,
2010)
14. Saaty, T.: Decision Making for Leaders: The Analytic Hierarchy Process for Decisions in a
Complex World, 3rd Revised edn. RWS Publications (2012)
15. Noyes, J., Cook, M.: Decision Making in Complex Environments, 1st edn. CRC Press (2017)
16. de Haan, A., de Heer, P.: Solving Complex Problems: Professional Group Decision-Making
Support in Highly Complex Situations, 2nd Revised edn. Eleven International Publishing
(2015)
17. Hammond, J., Keeney, R., Raiffa, H.: Smart Choices: A Practical Guide to Making Better
Decisions. Crown Business (2002)
18. Adair, J.: Decision Making and Problem Solving (Creating Success). Kogan Page (2013)
19. Kim, J.Y.: A study on learners’ perceptional typology and relationships among the learner’s
types, characteristics, and academic achievement in a blended e-education environment.
Comput. Educ. 59(2), 304–315 (2012)
20. Linder, K.E.: The Blended Course Design Workbook: A Practical Guide, Workbook edn.
Stylus Publishing (2016)
21. Kalberer, N., Petendra, B., Bohmer, C., Schibelbein, A., Beck-Meuth, E.M.: Evaluation process
and quality management in a blended-learning bachelor’s programme. Procedia Soc. Behav.
Sci. 228, 131–137 (2016)
22. Zlatovic, M., Balaban, I., Kermek, D.: Using online assessments to stimulate learning strategies
and achievement of learning goals. Comput. Educ. 91, 32–45 (2015)
23. Barak, M., Watted, A., Haick, H.: Motivation to learn in massive open online courses:
examining aspects of language and social engagement. Comput. Educ. 94(3), 49–80 (2016)
24. Sendra-Portero, F., Torales-Chaparro, O.E., Ruiz-Gomez, M.J., Martinez-Morillo, M.: A pilot
study to evaluate the use of virtual lectures for undergraduate radiology teaching. Eur. J. Radiol.
82(5), 888–893 (2013)
25. Yilmaz, O.: The effects of “live online courses” on students’ achievement at distance education.
Procedia Soc. Behav. Sci. 55(5), 347–354 (2012)
26. Song, H., Kim, J., Luo, W.: Teacher-student relationship in online classes: a role of teacher
self-disclosure. Comput. Hum. Behav. 54, 436–443 (2016)
27. Shukor, N.A., Tasir, Z., der Meijden, H.V.: An examination of online learning effectiveness
using data mining. Procedia Soc. Behav. Sci. 172, 555–562 (2015)
28. Cho, Y.H., Choi, H., Shin, J., Yu, H.C., Kim, Y.K., Kim, J.Y.: Review of research on online
learning environments in higher education. Procedia Soc. Behav. Sci. 191(2), 2012–2017
(2015)
29. Beach, P.: Self-directed online learning: a theoretical model for understanding elementary
teachers’ online learning experiences. Teach. Teacher Educ. 61, 60–72 (2017)
44 A. S. Belenky

30. Rienties, B., Brouwer, N.Z., Lygo-Baker, S.: The effects of online professional development on
higher education teachers’ beliefs and intentions towards learning facilitation and technology.
Teach. Teacher Educ. 29(1), 122–133 (2013)
31. Karaman, S., Kucuk, S., Aydemir, M.: Evaluation of an online continuing education program
from the perspective of new graduate nurses. Nurse Educ. Today 34(5), 836–841 (2014)
32. Abdelaziz, M., Kamel, S.S., Karam, O., Abdelrahman, A.: Evaluation of e-learning program
versus traditional lecture instruction for undergraduate nursing students in a faculty of nursing.
Teach. Learn. Nurs. 6(2), 50–58 (2011)
33. Pei, L., Wu, H.: Does online learning work better than offline learning in undergraduate medical
education? A systematic review and meta-analysis. Med. Educ. Online 24(1), 1–13 (2019)
34. Mubarak, A., Al-Arimi, A.K.: Distance learning. Procedia Soc. Behav. Sci. 152, 82–88 (2014)
35. Tsiotakis, P., Jimoyiannis, A.: Critical factors towards analyzing teachers’ presence in on-line
learning communities. Internet High. Educ. 28(1), 45–58 (2016)
36. van Rooij, S.W., Zirkle, K.: Balancing pedagogy, student readiness and accessibility: a case
study in collaborative online course development. Internet High. Educ. 28(1), 1–7 (2016)
37. Markova, T., Glazkova, I., Zaborova, E.: Quality issues of online distance learning. Procedia
Soc. Behav. Sci. 237, 685–691 (2017)
38. Cohen, A., Baruth, O.: Personality, learning, and satisfaction in fully online academic courses.
Comput. Hum. Behav. 72, 1–12 (2017)
39. Boling, E.C., Hough, M., Krinsky, H., Saleem, H., Stevens, M.: Cutting the distance in distance
education: perspectives on what promotes positive, online learning experiences. Internet High.
Educ. 15(2), 118–126 (2012)
40. Moore, J.L., Dickson-Deane, C., Galyen, K.: E-learning, online learning, and distance learning
environments: are they the same? Internet High. Educ. 14(2), 129–135 (2011)
41. Sanders-Smith, S., Smith-Bonahue, T., Soutullo, O.: Practicing teachers’ responses to case
method of instruction in an online graduate course. Teach. Teacher Educ. 54(2), 1–11 (2016)
42. Vo, H.M., Zhu, C., Diep, N.A.: The effect of blended learning on student performance at
course-level in higher education: a meta-analysis. Stud. Educ. Eval. 53, 17–28 (2017)
43. Ocak, M.A.: Why are faculty members not teaching blended courses? Insights from faculty
members. Comput. Educ. 56(3), 689–693 (2011)
44. Lust, G., Elen, J., Clarebout, G.: Regulation of tool-use within a blended course: Student
differences and performance effects. Comput. Educ. 60(1), 385–395 (2013)
45. Ellis, R.A., Pardo, A., Han, F.: Quality in blended learning environments - significant
differences in how students approach learning collaborations. Comput. Educ. 102, 90–102
(2016)
46. Boelens, R., De Wever, B., Voet, M.: Four key challenges to the design of blended learning: a
systematic literature review. Educ. Res. Rev. 22, 1–18 (2017)
47. Dziuban, C., Moskal, P.: A course is a course is a course: factor invariance in student evaluation
of online, blended and face-to-face learning environments. Internet High. Educ. 14(4), 236–
241 (2011)
48. Nazarenko, A.L.: Blended learning vs traditional learning: what works? (a case study research).
Procedia Soc. Behav. Sci. 200, 77–82 (2015)
49. Slechtova P., Hana, V., Jan, V.: Blended learning: promising strategic alternative in higher
education. Procedia Soc. Behav. Sci. 171, 1245–1254 (2015)
50. Sophonhiranrak, S., Suwannatthachote, P., Ngudgratoke, S.: Factors affecting creative problem
solving in the blended learning environment: a review of the literature. Procedia Soc. Behav.
Sci. 174, 2130–2136 (2015)
51. Havnes, A., Christiansen, B., Bjork, I.T., Hessevaagbakke, E.: Peer learning in higher educa-
tion: patterns of talk and interaction in skills centre simulation. Learn. Cult. Soc. Interaction 8,
75–87 (2016)
52. Brunello, G., De Paola, M., Scoppa, V.: Residential peer effects in higher education: does the
field of study matter? Econ. Enquiry 48(3), 621–634 (2010)
Developing and Running a Set of Competitive College/University Blended Courses 45

53. Sacerdote, B.: Peer effects in education: How might they work, how big are they, and how
much do we know thus far? In: Handbook of the Economics of Education, vol. 3, pp. 249–277
(2010)
54. Kashefi, H., Ismail, Z., Yusof, Y.M.: Overcoming students obstacles in multivariable calculus
through blended learning: a mathematical thinking approach. Procedia Soc. Behav. Sci. 56,
579–586 (2012)
55. Savenye, W.C., Olina, Z., Niemczyk, M.: So you are going to be an online writing instructor:
issues in designing, developing, and delivering an online course. Comput. Compos. 18(4),
371–385 (2001)
56. Mo, Z.: Research on the teaching model application tendency of blended learning in the
displayed design course. IERI Procedia 2, 204–208 (2012)
57. McGee, P., Reis, A.: Blended course design: a synthesis of best practices. J. Asynchronous
Learn. Netw. 16(4), 7–22 (2012)
58. Vai, M., Sosulski, K.: Essentials of Online Course Design: A Standards-Based Guide (Essen-
tials of Online Learning), 2nd edn. Routledge (2015)
59. Belenky, A.: Analyzing the potential of a firm: an operations research approach, Autom.
Remote Control 35(13), 1405–1424 (2002)
60. Hu, T.C., Kahng, A.B.: Linear and Integer Programming Made Easy, 1st edn. Springer (2016)
61. Belenky, A.: Two classes of games on polyhedral sets in systems economic studies. In: Network
Models in Economics and Finance, pp. 35–80. Springer (2014)
SARAH-Based Variance-Reduced
Algorithm for Stochastic Finite-Sum
Cocoercive Variational Inequalities

Aleksandr Beznosikov and Alexander Gasnikov

1 Introduction

In this paper we focus on the following unconstrained variational inequality (VI)

problem:

Find z∗ ∈ Rd such that F (z∗ ) = 0,

. (1)

where .F : Rd → Rd is some operator. This formulation is broad and encompasses

many popular classes of tasks arising in practice. The simplest, however, widely
encountered example of the VI is the minimization problem:

. min f (z).
z∈Rd

To represent it in the form (1), it is sufficient to take .F (z) = ∇f (z). As another also
popular practical example, we can consider a saddle point or min-max problem:

. min max g(x, y).

x∈Rdx y∈Rdy

Here we need to take .F (z) = [∇x g(x, y), −∇y g(x, y)].

A. Beznosikov
Moscow Institute of Physics and Technology, Dolgoprudny, Russia
HSE University, Moscow, Russia
A. Gasnikov ()
Moscow Institute of Physics and Technology, Dolgoprudny, Russia
IITP RAS, Moscow, Russia
Caucasus Mathematical Center, Adyghe State University, Maikop, Russia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 47

From a machine learning perspective, it is interesting not the deterministic

formulation (1), but the stochastic one. More
specifically,
we want to consider the
setup with the operator .F (z) = Eξ ∼D Fξ (z) , where .ξ is a random variable, .D
is some distribution, .Fξ : Rd → Rd is a stochastic operator. But it is often the
case (especially in practical problems) that the distribution .D is unknown, but we
have some samples from .D. Then, one can replace with a finite-sum Monte Carlo
approximation, i.e.

1
n
F (z) =
. Fi (z). (2)
n
i=1

In the case of minimization problems, statements of the form (1) + (2) are also called
empirical risk minimization [37]. These types of problems arise both in classical
machine learning problems such as simple regressions and in complex, large-scale
problems such as neural networks [23]. When it comes to saddle point problems, in
recent times the so-called adversarial approach has become popular. Here one can
highlight Generative Adversarial Networks (GANs) [13] and the adversarial training
of models [25, 40].
Based on the examples mentioned above, it can be noted that for operators of the
form (2), computing the full value of F is a possible but not desirable operation,
since it is typically very expensive compared to computing a single operator .Fi .
Therefore, when constructing an algorithm for the problem (1) + (2), one wants to
avoid computing (or compute very rarely) the full F operator. This task can be
solved by a stochastic gradient descent (SGD) framework. Currently, stochastic
methods for minimization problems already have a huge background [14]. The first
methods of this type were proposed back in the 1950s by Robbins and Monro [35].
For example, in the most classic variant, SGD could be written as follows:

zk+1 = zk − ηv k ,
. (3)

where .η > 0 is a predefined step-size and .v k = ∇fi (zk ), where .i ∈ [n] is

chosen randomly [38]. In this case, the variance of .vt is the main source of slower
convergence or convergence only to the neighbourhood of the solution [7, 21, 31].
But for minimization problems of the finite-sum type, one can achieve stronger
theoretical and practical results compared to the method (3). This requires the use of
a variance reduction technique. Recently, many variance-reduced variants of SGD
have been proposed, including SAG/SAGA [10, 34, 36], SVRG [3, 19, 39], MISO
[27], SARAH [18, 29, 30, 32], SPIDER [11], STORM [9], PAGE [24]. The essence
of one of the earliest and best known variance-reduced methods SVRG is to use
.v = ∇fi (z ) − ∇fi (z̃) + ∇f (z̃), where .i ∈ [n] is picked at random, where .i ∈
k k

[n] is picked at random and the point .z̃ is updated very rarely (hence we do not
need to compute the full gradient often). With this type of methods it is possible to
achieve a linear convergence to the solution. But for both convex and non-convex
SARAH for Stochastic Cocoercive Variational Inequalities 49

smooth minimization problems, the best theoretical guarantees of convergence are

given by other variance-reduced technique SARAH (and its modifications: SPIDER,
STORM, PAGE).
In turn, stochastic methods are also investigated for variational inequalities and
saddle point problems [4–6, 12, 15–17, 20, 28], including methods based on variance
reduction techniques [1, 2, 5, 6, 8, 22, 33]. Most of these methods are based on the
SVRG approach. At the same time, SARAH-based methods have not been explored
for VIs. But as we noted earlier, these methods are the most attractive from the
theoretical point of view for minimization problems. The purpose of this paper is to
partially close the question of SARAH approach for stochastic finite-sum variational
inequalities.

2 Problem Setup and Assumptions

Notation We use .x, y := ni=1 xi yi to denote standard inner product of .x, y ∈
Rd where .xi corresponds to the i-th component of x in
√ the standard basis in .R . It
d

induces .2 -norm in .R in the following way . x 2 := x, x.

Recall that we consider the problem (1), where the operator F has the form (2).
Additionally, we assume
Assumption 1 (Cocoercivity) Each operator Fi is -cocoercive, i.e. for all u, v ∈
Rd we have

. Fi (u) − Fi (v) 2
≤ Fi (u) − Fi (v), u − v. (4)

This assumption is somehow a more restricted analogue of the Lipschitzness

of Fi . For convex minimization problems, -Lipschitzness and -cocoercivity are
equivalent. Regarding variational inequalities and saddle point problems, see [26].
Assumption 2 (Strong Monotonicity) The operator F is μ-strongly monotone,
i.e. for all u, v ∈ Rd we have

F (u) − F (v); u − v ≥ μ u − v 2 .
. (5)

For minimization problems this property means strong convexity, and for saddle
point problems strong convexity–strong concavity.

3 Main Part

For general Lipschitzness variational inequalities, stochastic methods are usually

based not on SGD, but on the Stochastic Extra Gradient method [20]. But due to the
50 A. Beznosikov and A. Gasnikov

fact that we consider cocoercive VIs, it is sufficient to look at SGD like methods for
this class of problems. For example, [26] considers SGD, [6]—SVRG. Following
this reasoning, we base our method on the original SARAH [29].

Algorithm 1 SARAH [29] for stochastic cocoercive variational inequalities

1: Parameters: Stepsize γ > 0, number of iterations K, S.
2: Initialization: Choose z̃0 ∈ Rd .
3: for s = 1, 2, . . . , S do
4: z0 = z̃s−1
5: v 0 = F (z0 )
6: z1 = z0 − γ v 0
7: for k = 1, 2, . . . , K − 1 do
8: Sample ik independently and uniformly from [n]
9: v k = Fik (zk ) − Fik (zk−1 ) + v k−1
10: zk+1 = zk − γ v k
11: end for
12: z̃s = zK
13: end for

Next, we analyse the convergence of this method. Note that we will use the vector
v K in the analysis, but in reality this vector is not calculated by the algorithm. Our
.

proof are heavily based on the original work on SARAH [29]. Lemma 1 gives an
understanding of how . v k 2 behaves during the internal loop of Algorithm 1.
Lemma 1 Suppose that Assumptions 1 and 2 hold. Consider SARAH (Algorithm 1)
with .γ ≤ 1 . Then, we have

.E[ v K 2
] ≤(1 − γ μ)K E[ F (z0 ) 2 ].

Proof We start the proof with an update for .v k :

. vk 2
= v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
+ 2Fik (zk ) − Fik (zk−1 ), v k−1 .

Next, we use an update for .zk and make a small rearrangement

2
. vk 2
= v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
− Fik (zk ) − Fik (zk−1 ), zk − zk−1
γ
1
= v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
− Fik (zk ) − Fik (zk−1 ), zk − zk−1
γ
1
− Fik (zk ) − Fik (zk−1 ), zk − zk−1 .
γ
SARAH for Stochastic Cocoercive Variational Inequalities 51

Taking the full mathematical expectation, we obtain

E[ v k
.
2
] =E[ v k−1 2 ] + E[ Fik (zk ) − Fik (zk−1 ) 2 ]
1
− E[Fik (zk ) − Fik (zk−1 ), zk − zk−1 ]
γ
1
− E[Fik (zk ) − Fik (zk−1 ), zk − zk−1 ].
γ

Independence of the .ik generation gives

. E[ v k 2
] =E[ v k−1 2 ] + E[ Fik (zk ) − Fik (zk−1 ) 2 ]
1
− E[Fik (zk ) − Fik (zk−1 ), zk − zk−1 ]
γ
1
− E[Eik [Fik (zk ) − Fik (zk−1 )], zk − zk−1 ]
γ
=E[ v k−1 2 ] + E[ Fik (zk ) − Fik (zk−1 ) 2 ]
1
− E[Fik (zk ) − Fik (zk−1 ), zk − zk−1 ]
γ
1
− E[F (zk ) − F (zk−1 ), zk − zk−1 ].
γ

With Assumptions 1 and 2, we get

E[ v k
.
2
] ≤E[ v k−1 2 ] + E[ Fik (zk ) − Fik (zk−1 ) 2 ]
1
− E[ Fik (zk ) − Fik (zk−1 ) 2 ]
γ
μ
− E[ zk − zk−1 2 ]
γ

γ − 1
=(1 − γ μ)E[ v k−1 2
]+ E[ Fik (zk ) − Fik (zk−1 ) 2 ].
γ

In the last step we substitute .zk−1 − zk = γ v k . The choice of .0 < γ ≤ 1

gives

E[ v k
.
2
] ≤(1 − γ μ)E[ v k−1 2 ].

Running recursion and using .v 0 = F (z0 ), we finish the proof.

The following lemma gives how different .v K and .F (zK ) are in the inner loop of
Algorithm 1.
52 A. Beznosikov and A. Gasnikov

Lemma 2 Suppose that Assumption 1 holds. Consider SARAH (Algorithm 1). Then,
we have
γ
E[ F (zK ) − v K
.
2
]≤ E[ F (z0 ) 2 ].
2 − γ

Proof Let us consider the following chain of reasoning:

E[ F (zk ) − v k
.
2
] =E[ [F (zk−1 ) − v k−1 ] + [F (zk ) − F (zk−1 )] − [v k − v k−1 ] 2 ]
=E[ F (zk−1 ) − v k−1 2 ] + E[ F (zk ) − F (zk−1 ) 2 ]
+ E[ v k − v k−1 2 ]
+ 2E[F (zk−1 ) − v k−1 , F (zk ) − F (zk−1 )]
− 2E[F (zk−1 ) − v k−1 , v k − v k−1 ]
− 2E[F (zk ) − F (zk−1 ), v k − v k−1 ]
=E[ F (zk−1 ) − v k−1 2 ] + E[ F (zk ) − F (zk−1 ) 2 ]
+ E[ v k − v k−1 2 ]
+ 2E[F (zk−1 ) − v k−1 , F (zk ) − F (zk−1 )]
− 2E[F (zk−1 ) − v k−1 , Eik [v k − v k−1 ]]
− 2E[F (zk ) − F (zk−1 ), Eik [v k − v k−1 ]]
=E[ F (zk−1 ) − v k−1 2 ] − E[ F (zk ) − F (zk−1 ) 2 ]
+ E[ v k − v k−1 2 ]
≤E[ F (zk−1 ) − v k−1 2 ] + E[ v k − v k−1 2 ].

Here we also use that

Eik [v k − v k−1 ] = Eik [Fik (zk ) − Fik (zk−1 )] = F (zk ) − F (zk−1 ).

Running recursion and using .v 0 = F (z0 ), we have

K
E[ F (zK ) − v K
.
2
]≤ E[ v k − v k−1 2 ]. (6)
k=1

In the same way as in Lemma 1, we can derive

. vk 2
= v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
+ 2Fik (zk ) − Fik (zk−1 ), v k−1
2
= v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
− Fik (zk ) − Fik (zk−1 ), zk − zk−1
γ
SARAH for Stochastic Cocoercive Variational Inequalities 53

2
≤ v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
− Fi (zk ) − Fik (zk−1 ) 2
γ k

γ − 2
= v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
γ

γ − 2
= v k−1 2
+ v k − v k−1 2 .
γ

After a small rewriting and with the full expectation, we get

γ
E[ v k − v k−1 2 ] ≤
. E[ v k−1 2
− vk 2
].
2 − γ

By substituting this into the expression (6) and using .v 0 = F (z0 ), we finish the
proof.
Let us combine Lemmas 1 and 2 into the main theorem of this paper.
Theorem 1 Suppose that Assumptions 1 and 2 hold. Consider SARAH (Algo-
rithm 1) with .γ = 9
2
and .K = 10
μ . Then, we have

1
E[ F (z̃s ) 2 ] ≤
. E[ F (z̃s−1 ) 2 ].
2
Proof We start from

E[ F (zK ) 2 ] ≤ 2E[ F (zK ) − v K

.
2
] + 2[E v K 2
].

Applying Lemma 1 and 2, we have

2γ
E[ F (zK ) 2 ] ≤
. + 2(1 − γ μ)K E[ F (z0 ) 2 ]
2 − γ

2γ
≤ + 2 exp(−γ μK) E[ F (z0 ) 2 ].
2 − γ

Here we also use that .γ μ ∈ (0; 1) (for .γ ≤ 2

9 ) and then .(1 − γ μ) ≤ exp(−γ μ).
The substitution .γ and K gives

1
E[ F (zK ) 2 ] ≤
. E[ F (z0 ) 2 ].
2

We know that .z0 = z̃s−1 and .zK = z̃s and have

1
E[ F (z̃s ) 2 ] ≤
. E[ F (z̃s−1 ) 2 ].
2

54 A. Beznosikov and A. Gasnikov

Since we need to find a point z such that .F (z) ≈ F (z∗ ) = 0, we can easily get
an estimate on the oracle complexity (number of .Fi calls) to achieve precision .ε.
Corollary 1 Suppose that Assumptions 1 and 2 hold. Consider SARAH (Algo-
rithm 1) with .γ = 9
2
μ . Then, to achieve .ε-solution (.E F (z̃ )
and .K = 10 S 2 ∼ ε 2 ),

we need

F (z0 ) 2
.O n+ log2 oracle calls.
μ ε2

Proof From Theorem 1 we need the following number of outer iterations:

2
F (z0 )
S = O log2
. .
ε2

At each outer iteration we compute the full operator one time, and at the remaining
K − 1 iterations we call the single operator .Fi two times per one inner iteration.
.

Then, the total number of oracle calls is

2
F (z0 )
S × (2 × (K − 1) + n) = O
. n+ log2 .
μ ε2

Note that the obtained oracle complexity coincides with the similar complexity
for SVRG from [6]. It is interesting to see how these methods behave in practice.

4 Experiments

The aim of our experiments is to compare the performance of different methods for
stochastic finite-sum cocoercive variational inequalities. In particular, we use SGD
from [26], SVRG from [6] and SARAH. We conduct our experiments on a finite-
sum bilinear saddle point problem:
n
1 λ λ
.g(x, y) = gi (x, y) = x Ai y + ai x + bi y + x 2
− y 2
, (7)
n 2 2
i=1

where .Ai ∈ Rd×d , .ai , bi ∈ Rd . This problem is .λ-strongly

convex–strongly concave
and, moreover, L-smooth with .L = A 2 for .A = n1 ni=1 Ai . We take .n = 10,
.d = 100 and generate matrix A and vectors .ai , bi randomly, .λ = 1. For this problem
A 2
the cocoercivity constant . = λ 2 . The steps of the methods are selected for best
convergence. For SVRG and SARAH the number of iterations for the inner loops is
SARAH for Stochastic Cocoercive Variational Inequalities 55

(a) (b) (c)

Fig. 1 Bilinear problem (7): Comparison of state-of-the-art SGD-based methods for stochastic
cocoercive VIs. (a) Small .. (b) Medium .. (c) Large .

taken as . λ . We run three experiment setups: with small . ≈ 102 , medium . ≈ 103
and big . ≈ 104 .
See Fig. 1 for the results. We see that SARAH converges better than SVRG, and
SGD converges much slower.

Acknowledgments The authors would like to congratulate Boris Mirkin on his jubilee and wish
him good health and new scientific advances.
The work of A. Beznosikov was supported by the strategic academic leadership program ‘Pri-
ority 2030’ (Agreement 075-02-2021-1316 30.09.2021). The work of A. Gasnikov was supported
by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye), No.
075-00337-20-03, project No. 0714-2020-0005.

References

1. Alacaoglu, A., Malitsky, Y.: Stochastic variance reduction for variational inequality methods.
Preprint. arXiv:2102.08352 (2021)
2. Alacaoglu, A., Malitsky, Y., Cevher, V.: Forward-reflected-backward method with variance
reduction. Comput. Optim. Appl. 80 (2021). https://doi.org/10.1007/s10589-021-00305-3
3. Allen-Zhu, Z., Yuan, Y.: Improved SVRG for non-strongly-convex or sum-of-non-convex
objectives. In: International Conference on Machine Learning, pp. 1080–1089. PMLR (2016)
4. Beznosikov, A., Samokhin, V., Gasnikov, A.: Distributed saddle-point problems: lower bounds,
optimal and robust algorithms. Preprint. arXiv:2010.13112 (2020)
5. Beznosikov, A., Gasnikov, A., Zainulina, K., Maslovskiy, A., Pasechnyuk, D.: A unified
analysis of variational inequality methods: variance reduction, sampling, quantization and
coordinate descent. Preprint. arXiv:2201.12206 (2022)
6. Beznosikov, A., Gorbunov, E., Berard, H., Loizou, N.: Stochastic gradient descent-ascent:
unified theory and new efficient methods. Preprint. arXiv:2202.07262 (2022)
7. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning.
Siam Rev. 60(2), 223–311 (2018)
8. Chavdarova, T., Gidel, G., Fleuret, F., Lacoste-Julien, S.: Reducing noise in gan training with
variance reduced extragradient. 32, 393–403 (2019)
9. Cutkosky, A., Orabona, F.: Momentum-based variance reduction in non-convex SGD. Preprint.
arXiv:1905.10018 (2019)
10. Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: a fast incremental gradient method with support
for non-strongly convex composite objectives. In: Advances in Neural Information Processing
Systems, pp. 1646–1654 (2014)
56 A. Beznosikov and A. Gasnikov

11. Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: near-optimal non-convex optimization via
stochastic path integrated differential estimator. Preprint. arXiv:1807.01695 (2018)
12. Gidel, G., Berard, H., Vignoud, G., Vincent, P., Lacoste-Julien, S.: A variational inequality
perspective on generative adversarial networks. Preprint. arXiv:1802.10551 (2018)
13. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
A., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
14. Gorbunov, E., Hanzely, F., Richtárik, P.: A unified theory of SGD: Variance reduction,
sampling, quantization and coordinate descent. In: International Conference on Artificial
Intelligence and Statistics, pp. 680–690. PMLR (2020)
15. Gorbunov, E., Berard, H., Gidel, G., Loizou, N.: Stochastic extragradient: general analysis and
improved rates. In: International Conference on Artificial Intelligence and Statistics, pp. 7865–
7901. PMLR (2022)
16. Hsieh, Y.G., Iutzeler, F., Malick, J., Mertikopoulos, P.: On the convergence of single-call
stochastic extra-gradient methods. In: Advances in Neural Information Processing Systems,
vol. 32, pp. 6938–6948. Curran Associates, Inc. (2019)
17. Hsieh, Y.G., Iutzeler, F., Malick, J., Mertikopoulos, P.: Explore aggressively, update conser-
vatively: Stochastic extragradient methods with variable stepsize scaling. 33, 16223–16234
(2020)
18. Hu, W., Li, C.J., Lian, X., Liu, J., Yuan, H.: Efficient smooth non-convex stochastic
compositional optimization via stochastic recursive gradient descent. In: Advances in Neural
Information Processing Systems. vol. 32, pp. 6929–6937. Curran Associates, Inc. (2019)
19. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance
reduction. In: Advances in Neural Information Processing Systems, vol. 26, pp. 315–323
(2013)
20. Juditsky, A., Nemirovski, A., Tauvel, C.: Solving variational inequalities with stochastic
mirror-prox algorithm. Stoch. Syst. 1(1), 17–58 (2011)
21. Khaled, A., Richtárik, P.: Better theory for SGD in the nonconvex world. Preprint.
arXiv:2002.03329 (2020)
22. Kovalev, D., Beznosikov, A., Borodich, E., Gasnikov, A., Scutari, G.: Optimal gradient sliding
and its application to distributed optimization under similarity. Preprint. arXiv:2205.15136
(2022)
23. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
24. Li, Z., Bao, H., Zhang, X., Richtárik, P.: Page: a simple and optimal probabilistic gradient
estimator for nonconvex optimization. In: International Conference on Machine Learning, pp.
6286–6295. PMLR (2021)
25. Liu, X., Cheng, H., He, P., Chen, W., Wang, Y., Poon, H., Gao, J.: Adversarial training for large
neural language models. Preprint. arXiv:2004.08994 (2020)
26. Loizou, N., Berard, H., Gidel, G., Mitliagkas, I., Lacoste-Julien, S.: Stochastic gradient
descent-ascent and consensus optimization for smooth games: convergence analysis under
expected co-coercivity. In: Advances in Neural Information Processing Systems, vol. 34, pp.
19095–19108 (2021)
27. Mairal, J.: Incremental majorization-minimization optimization with application to large-scale
machine learning. SIAM J. Optim. 25(2), 829–855 (2015)
28. Mishchenko, K., Kovalev, D., Shulgin, E., Richtárik, P., Malitsky, Y.: Revisiting stochastic
extragradient. In: International Conference on Artificial Intelligence and Statistics, pp. 4573–
4582. PMLR (2020)
29. Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine
learning problems using stochastic recursive gradient. In: International Conference on Machine
Learning, pp. 2613–2621. PMLR (2017)
30. Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: Stochastic recursive gradient algorithm for
nonconvex optimization. Preprint. arXiv:1705.07261 (2017)
31. Nguyen, L.M., Nguyen, P.H., Richtárik, P., Scheinberg, K., Takác, M., van Dijk, M.: New
convergence aspects of stochastic gradient algorithms. J. Mach. Learn. Res. 20(176), 1–49
(2019), http://jmlr.org/papers/v20/18-759.html
SARAH for Stochastic Cocoercive Variational Inequalities 57

32. Nguyen, L.M., Scheinberg, K., Takáč, M.: Inexact SARAH algorithm for stochastic optimiza-
tion. Optim. Methods Softw. 36(1), 237–258 (2021)
33. Palaniappan, B., Bach, F.: Stochastic variance reduction methods for saddle-point problems.
In: Advances in Neural Information Processing Systems, pp. 1416–1424 (2016)
34. Qian, X., Qu, Z., Richtárik, P.: Saga with arbitrary sampling. In: International Conference on
Machine Learning, pp. 5190–5199. PMLR (2019)
35. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407
(1951)
36. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average
gradient. Math. Program. 162(1–2), 83–112 (2017)
37. Shalev-Shwartz, S., Ben-David, S.: Understanding machine learning: from theory to algo-
rithms. Cambridge University Press (2014)
38. Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient
solver for SVM. Math. Program. 127(1), 3–30 (2011)
39. Yang, Z., Chen, Z., Wang, C.: Accelerating mini-batch SARAH by step size rules. Inf. Sci.
558, 157–173 (2021)
40. Zhu, C., Cheng, Y., Gan, Z., Sun, S., Goldstein, T., Liu, J.: Freelb: enhanced adversarial training
for natural language understanding. Preprint. arXiv:1909.11764 (2019)
Dimensionality Reduction Using
Pseudo-Boolean Polynomials for Cluster
Analysis

Tendai Mapungwana Chikake and Boris Goldengorin

1 Introduction

In the fields of data visualization and cluster analysis, dimensionality reduction

mechanisms play pivotal roles in providing better understanding of relations and
clusters in input data. Dimensionality reduction techniques work by transforming
data from high-dimensional spaces into low-dimensional spaces such that the low-
dimensional representations retain meaningful properties of the original data, ideally
close to their intrinsic dimensions [1].
Real-world data is often available in high-dimensional spaces which are usually
cognitively and computationally hard to process [2]. Human observers as well as
presentation mediums readily available, like 2-dimensional papers or screens, are
presented with representational challenges whenever data is available in higher
than 3-dimensional spaces and consequently the identity of classes, useful or noisy
features becomes harder to discover [2].
Data scientists often spend enormous amounts of time and effort digging for
relevant features that determine classes or those features that bring useless noise in

T. M. Chikake ()
Department of Discrete Mathematics, Moscow Institute of Physics and Technology, Moscow,
Russian Federation
e-mail: [email protected]
B. Goldengorin
Department of Mathematics, New Uzbekistan University, Tashkent, Uzbekistan
The Scientific and Educational Mathematical Center «Sofia Kovalevskaya Northwestern Center
for Mathematical Research», Pskov State University, Pskov, Russia
Department of Discrete Mathematics, Phystech School of Applied Mathematics and Informatics,
Moscow Institute of Physics and Technology, Dolgoprudny, Moscow region, Russia
e-mail: [email protected]; [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 59

datasets. Automated systems like artificial neural networks can be used in the feature
selection processes [3], but often require large amounts of data and challenges like
feature superposition [4], underfitting, overfitting, and interpretation concerns arise
[5].
In fields where large numbers of observations and/or large numbers of variables
exist such as signal processing, computer vision, speech recognition, neuroinfor-
matics, and bioinformatics, usage of dimension reduction techniques is crucial
[6]. Dimensionality reduction simplifies cluster analysis tasks for both human and
machine processors [6].
In this work, we observe that our powerful dimensionality reduction method can
assist in reducing the abuse of statistical methods and/or artificial neural networks
in tasks that can be solved in combinatorial steps. We show this by qualifying
our dimensionality reduction method, followed by linear clustering of reduced
samples. This advantage is available because our reduction method has invariability
properties, and it is intuitively easy to interpret. The problems of invariability and
interpretability are among the top problems of currently available dimensionality
reduction methods [7]. Dimensionality reduction tools that lack interpretability
and/or invariability can be disfavoured in critical tasks such as clustering models
that input medical tests/measurements to predict a diagnosis. Our work is directed
to solving such challenges.
Cluster analysis or clustering is the task of grouping a set of objects in such
a way that objects in the same group (called a cluster) are more similar (in some
sense) to each other than to those in other clusters [8]. Clustering is a major task of
exploratory data analysis, and a common technique for statistical data analysis, used
in many fields, including pattern recognition, image analysis, information retrieval,
bioinformatics, data compression, computer graphics and machine learning [8].
Abuse of statistical methods and/or artificial neural networks often arise in cases
where data is minimal. This abuse may result in overfitted, irreproducible or hard-
to-interpret solutions which may be undesirable to use in some critical tasks.
The invariant manipulation based on formulation of pseudo-Boolean polyno-
mials presented in this work, can enable cluster analysts to extract simple rules
of associations in data without the need for machine learning. The formulation of
pseudo-Boolean polynomials is very simple, computationally efficient, invariant to
ordering, and easy to explain and reproduce.
Our overall contributions in this paper are:
1. Qualifying the usage of the reduction property of penalty-based pseudo-Boolean
polynomials formulation for dimensionality reduction of multidimensional data
where it is feasible.
2. Reducing overdependence on data-driven approaches in solving problems that
can be solved with combinatorial steps.
Dimensionality reduction using pseudo-Boolean polynomials formulation,
revolves around the manipulation of the reduction and equivalence properties
of penalty-based pseudo-Boolean polynomials [9].
Dimensionality Reduction Using Pseudo-Boolean Polynomials for Cluster Analysis 61

We present our results on classical Wisconsin Diagnostic Breast Cancer (WDBC)

[10] and Iris Flower datasets [11], which have too few samples, such that the usage
of a data-driven methods like artificial neural networks for clustering would result
in abuse.
The Iris Flower dataset [11] has samples of size .1 × 4 and present challenges of
identifying clusters by plotting on a Cartesian plane for the analyst while the Wis-
consin Diagnostic Breast Cancer (WDBC) [10] dataset, has samples of size .1 × 30
that result in 30-Dimensionality representation that would be incomprehensible for
Cartesian plot based analysis.
Our proposed method is limited to data whose samples can be represented as
cost matrices where each cell represents a cost relationship of its respective column
and row. In the experiment sections, we show how these complex datasets can be
reduced to .2 × 1 and .3 × 1 dimensionality which are easily analysed on a Cartesian
plane and Cartesian space respectively. Simple linear demarcations are then used to
qualify the label of a given sample without any machine learning process or non-
linear alteration of data.

2 Related Work

Principal Component Analysis (PCA) and the T-distribute Stochastic Neighbour

Embedding (t-SNE) are arguably the most popular dimensionality reduction meth-
ods in cluster analysis tasks. The choice of usage of either, is usually case based.
The T-distribute Stochastic Neighbour Embedding is often used for visualizing high
dimensional data. It works by converting similarities between data points to joint
probabilities and tries to minimize the Kullback-Leibler divergence between the
joint probabilities of the low-dimensional embedding and the high-dimensional data
[12]. T-SNE has a cost function that is not convex, i.e., with different initializations
different reductions may result [12]. The non-convex nature of the cost function in
the t-SNE tool is a major drawback in comparison with the Principal Component
Analysis method as well as the pseudo-Boolean polynomials-based reduction that
we qualify in this paper.
Dimensionality reduction by pseudo-Boolean polynomials formulation ensures
unique reduced Hammer-Beresnev polynomials, regardless of possible difference
in ordering of input matrices [9]. Consequently, our method is guaranteed to output
the same reductions, regardless of difference in initializations or input orderings.
Principal Component Analysis (PCA), is an orthogonal linear transformation that
transforms data to a new coordinate system such that the greatest variance by some
scalar projection of the data comes to lie on the first coordinate (called the first
principal component), the second-greatest variance on the second coordinate, and so
on [13]. The major drawback of PCA is that it changes the distances involved in our
data because it reduces dimensions in a way that preserves large pairwise distance
better than small pairwise distance [14]. These changes can be very sensitive to our
algorithms, especially when working with Euclidean distance-based algorithms.
62 T. M. Chikake and B. Goldengorin

Modifications to the Principal Component Analysis (PCA) method exists like

the kernel principal component analysis (kernel PCA) [1] and Graph-based kernel
PCA [15], that can be employed in a nonlinear manner but the distance and data
dependence concerns prevail.
Other techniques include Linear discriminant analysis (LDA) [16], Generalized
discriminant analysis (GDA) [17], and Uniform manifold approximation and pro-
jection (UMAP) [18]. All of these techniques, raise all or some of the concerns
outlined above as they are largely dependent on data distribution to operate.
The pseudo-Boolean polynomials approach presented in this work, operates in
isolation for each individual sample, such that no distribution biases are introduced
to any given sample. The combinatorial operations are constant and invariant to
ordering across all samples.

3 Methods

In mathematics and optimization, a pseudo-Boolean function is a function of the

form .f : Bn → R, where .B = {0, 1} is a Boolean domain and .n is a non-negative
integer called the degree of the function [19].
We utilise a penalty based formulation of pseudo-Boolean polynomials described
in [20] for data aggregation which provides us an invariant dimensionality reduction
property that we present in this work.
Goldengorin et al. [9] highlights fundamental reduction properties of pseudo-
Boolean polynomials which guarantee the maintenance of the underlying initial
information while reducing the size of the problem.
We extend the objective goal in [20] by describing our problem as a problem
of minimising the cost of describing a given sample and thereby reducing the
dimensionality of the particular sample.
Given a sample where observables (e.g. physical quantities, types of measure-
ments, etc.) .I = {1, 2, ..., m} (ordered by their correlation strengths to labels), their
physical measurements .J = {1, 2, ..., n}, and .p the maximum dimensionality size
desired, to minimise the cost of describing the given sample, our task is to find a set
.S ⊆ I with .|S| = p minimising the prespecified objective function.

We define the instance of the problem by an .m × n matrix .C = [cij ] of costs

(distances, bandwidth, time, (dis)similarities, (in)significance, etc.), .j ∈ J and
.i ∈ I , and the goal is to find a set .S ⊆ I with .|S| = p, such that we minimise total

cost

.fc (S) = min{cij |i ∈ S}, (1)
j ∈J

with the assumption that entries of .C are non-negative and finite [20].
According to AlBdaiwi et al. [20], the objective function .fc (S) of this kind
of problem can be formulated in terms of pseudo-Boolean polynomials and from
Dimensionality Reduction Using Pseudo-Boolean Polynomials for Cluster Analysis 63

[19] all pseudo-Boolean polynomials can be uniquely represented as multilinear

polynomials of the form

f (y) =
. cS yi (2)
S⊆I i∈S

Pseudo-Boolean polynomials formulation is achievable in polynomial time and

allows us to achieve
compact representations of relativelylarge problems [20].
From (2), . i∈S yi is the term of the monomial .cS i∈S yi . Monomials with
the same term are called similar monomials [9], and they can be added together
in a process called reduction [9] which is central to our dimensionality reduction
solution as it allows us to reduce the number of columns in the initial cost matrix.
In addition to compacting large problems, there exists different instances that
have similar (reduced) Hammer–Beresnev polynomials, mainly because similar
monomials can be aggregated and disaggregated [9].
These properties are essential in cluster analysis because samples which might
look dissimilar in higher dimensional space, can actually converge to similarity in
their reduced pseudo-Boolean polynomials form.
This work seeks to exploit these fundamental properties: representational reduc-
tion and equivalence as dimensionality reduction and clustering mechanisms respec-
tively.
By treating measurable attributes of multidimensional data, like physical mea-
surements, pixel positioning and intensity distribution in image data, and other
describable/measurable attributes as information costs of describing samples, we
can formulate for each sample, a cost matrix .C which we can manipulate and reduce,
in an ordering invariant manner, by pseudo-Boolean polynomials formulation and
achieve lower dimension representation of each sample independent of any other
samples in the dataset.
The equivalence [9] property and other distance comparisons can then be applied
on the reduced data representation for cluster analysis.
Visualizing a matrix sample of size .1 × n where .n ∈ {1, 2, 3} is easily compre-
hensive by scatter plotting along 1-D, 2-D, and 3-D planes respectively. If classes
are present in the data, we can identify linear or non-linear lines or plane separators
that demarcate boundaries of clusters in the data.
The pseudo-Boolean approach to dimensionality reduction in measured features
for sample clustering is a penalty-based approach that relies on the fact that we
require attributes that positively distinguish underlying classes for each instance to
be sufficiently represented based on their importance to the classifier.
The task seeks to minimize measurements that insignificantly contribute to the
identity of a sample in a specific class.
A sample is described by an .m × n matrix .C = [ci,j ] where .I represents the
measured feature while .J represent the measurement such that columns of the
matrix contain homogenous quantities. In reducing the dimensionality of samples
in the Iris Flower dataset [11] for example, .I represents measured features, sepal,
and petal while .J represent the measurements width and length thereof.
64 T. M. Chikake and B. Goldengorin

We define the decisive insignificance .S of an attribute to classifying a sample

into a specific cluster as

fc (S) =
. min{cij |i ∈ S} (3)
j ∈J

and the dimensionality reduction task is the problem of finding

S ∗ ∈ arg min{fc (S) : ∅ ⊂ S ⊆ I, |S| = p},

. (4)

where .p is the output dimension size, which we however choose to be .|J | such that
no measurement is lost.
By processing our samples into pseudo-Boolean polynomials we achieve the
sample representations with the least possible information costs needed to represent
them.

4 Experimental Setup

4.1 The Iris Flower Dataset

Table 1 shows the Iris Flower dataset [11], a classical and popular dataset in the
machine learning community. The task on this dataset is to classify Iris plants into
three species (Iris setosa, Iris versicolor, and Iris virginica) using the lengths and
widths measurements of their petals and sepals. The dataset contains 150 samples.
Our method requires that data is structured as matrices, where columns are
measurements and rows are the features measured. For this dimensionality reduction
task, we first reshape the structure of all instances from .1 × 4 sized instances to
.2 × 2 sized instances, where rows represent the measurement type (Sepal, Petal)

and the columns represent the measurements (Length and Width) of the features
collected from 3 different species:

Table 1 Iris dataset

ID Sepal length (cm) Sepal width (cm) Petal length (cm) Petal width (cm) Target
1 5.4 3.0 4.5 1.5 Versicolor
2 7.7 2.8 6.7 2.0 Virginica
3 5.2 3.4 1.4 0.2 Setosa
4 4.8 3.4 1.9 0.2 Setosa
. . . . . .
. . . . . .
. . . . . .
150 6.6 3.0 4.4 1.4 Versicolor
Dimensionality Reduction Using Pseudo-Boolean Polynomials for Cluster Analysis 65

Table 2 Transformed Length (cm) Width (cm)

instance
Sepal 5.4 3.0
Petal 4.5 1.5

Fig. 1 Scatter on sepal length-width

We transform the samples into costs matrices by making rows of measured

features (Sepal, Petal) and columns of measurements (Length, Width) resulting in
.2 × 2 costs matrices as shown in Table 2.

One cluster in the Iris Flower dataset, Iris-Setosa is linearly separable from others
while Iris-virginica and Iris-versicolour are not linearly separable between each
other as shown in Figs. 1 and 2.
Figure 1 shows the scatter plot on lengths and widths of sepal measurements
of the samples while Fig. 2 shows the scatter plot on lengths and widths of petal
measurements of the samples.
Applying the pseudo-Boolean polynomials reduction program, we reduce all
samples to .2 × 1 matrices. After applying combinations of like terms and dropping
of zero columns, the pseudo-Boolean polynomials of the instance in Table 2 reduces
to

6.0
.
2.4y2

Applying the reduction method on every sample, results in a list of samples

reduced to .2 × 1 matrices of the form .a + by2 which can be plotted and classified
on a Cartesian plane by simple boundary lines.
Our results on this dataset also expose the overfitting flaw which may be
overlooked in works that abuse artificial neural networks such as the reported 100%
clustering accuracy reported in [21].
66 T. M. Chikake and B. Goldengorin

Fig. 2 Scatter on petal length-width

4.2 The Wisconsin Diagnostic Breast Cancer (WDBC) Dataset

The Wisconsin Diagnostic Breast Cancer (WDBC) [10] dataset, is another classical
and popular dataset in the machine learning community. The dataset has 569
instances of 30 real-valued features that describe characteristics of the cell nuclei
present in digitized images extracted by a fine needle aspirate (FNA) on a breast
mass [22].
Ten real-valued features were computed for each cell nucleus:
1. radius (mean of distances from centre to points on the perimeter)
2. texture (standard deviation of greyscale values)
3. perimeter
4. area
5. smoothness (local variation in radius lengths)
6. compactness (.perimeter 2 /area − 1.0)
7. concavity (severity of concave portions of the contour)
8. concave points (number of concave portions of the contour)
9. symmetry
10. fractal dimension (“coastline approximation” .− 1)
The mean, standard error (se), and “worst” or largest (mean of the three largest
values) of these features were computed for each image, resulting in 30 features
[22].
The task on this dataset is to predict whether a breast cancer diagnosis is benign
or malignant based on these features.
Dimensionality Reduction Using Pseudo-Boolean Polynomials for Cluster Analysis 67

The best known predictive accuracy (97.5%) was obtained by using a separating
plane in the 3-D space of Worst Area, Worst Smoothness and Mean Texture features
using repeated tenfold cross-validations [22], and this classifier has correctly
diagnosed 176 consecutive new patients as of November 1995 [10].
The separating plane was obtained using Multisurface Method-Tree (MSM-T)
[23], a classification method which uses linear programming to construct a decision
tree where relevant features are selected using an exhaustive search in the space of
1–4 features and 1–3 separating planes [23].
Samples of size .3 × 10, result in 30-Dimensionality representations, that are hard
to visualise on a Cartesian plot and consequently hard for the analyst to identify the
features that are useful in making an accurate diagnosis.
To qualify our dimensionality reduction tool, we input each sample as a
.3 × n; n ∈ 1, 2, 3, . . . , 10 matrix where rows .J represent the measurement type

(Mean, Standard error (se), Worst) and the columns .I the measured features.
Applying the pseudo-Boolean polynomials formulation, we reduce all samples to
.3 × 1 matrices which can be plotted and classified on a Cartesian space by a simple

boundary plane. Since our formulation runs in polynomial time, we can iterate
all possible arrangements of measured features and discover the set of measured
features which has the best fitting plane that separates samples into benign or
malignant.

5 Results and Discussion

5.1 The Iris Flower Dataset

Figure 3 illustrates the identification of cluster boundaries by simple identification

of lines that best separate the aggregated samples on a Cartesian plane. Plotting the
coefficients of the reduced instances, where the terms 1 and y2 are abscissa and
ordinate respectively, allows us to visualize properly the possible clusters from the
data.
The most important thing to note is that this reduction correctly represents the
original instances with less information cost, and allows us to discover a pair of
straight lines that separate the respective clusters.
All instances that lie in the region .y ≤ x4 + 2, are classified safely as Iris-setosa,
20 + 5, separates Iris-versicolor from Iris-virginica.
while the line .y = 13x
It is also important to note that, some instances which seemed distinct of each
other, actually have similar reduced pseudo-Boolean polynomials form. e.g.

5.4 3.4
.
1.7 0.2
68 T. M. Chikake and B. Goldengorin

Fig. 3 Cartesian plot on resultant .2 × 1 dimensions after reduction by pseudo-Boolean polynomi-

als formulation and the best boundary lines

and

5.1 3.7
.
1.5 0.4

all reduce to

1.9
.
6.9y2

and in perfect confirmation of the equivalence condition; the instances also lie in
the same cluster.
Looking at the single outlier,

5.1 3.7
.
1.5 0.4

which was reduced to

2.0
.
6.7y2

and classified as Iris-virginica instead of Iris-versicolor we observe the faulty nature

of learning-based cluster methods such as support-vector machines and [21]’s X-
Boosted artificial neural network, which output 100% cluster accuracy on some test
runs. The original measurements of this outlier, perfectly fit the cluster Iris-virginica
Dimensionality Reduction Using Pseudo-Boolean Polynomials for Cluster Analysis 69

as there are samples of Iris-virginica that have very little difference in measured
values to this outlier, while its values are also significantly distinct from the other
Iris-versicolor samples. The incorrectly clustered instance might be attributed to
misclassification by the persons who labelled the dataset, or perhaps just a naturally
occurring outlier.
This finding exposes, the overfitting concerns that arise from learning-based
methods, like the Space Vector Machine (SVM) and [21]’s X-Boosted artificial
neural network, that would report 100% accuracy in some tests. Additionally,
changing the train/test data, result in varied accuracies when these methods are
used, thereby losing the invariant and reproducibility attributes that our method
guarantees.

5.2 The Wisconsin Diagnostic Breast Cancer (WDBC) Dataset

Plotting the coefficients of the reduced instances just as in the previous dataset,
allows us to visualize properly the possible clusters from the data.
The reduction correctly represents the original instances with less information
cost, and allows us to discover a combination of features with the best plane that
separates the respective clusters.
Of all the combinations of features explored, the combination with the best
separating plane (95.4% accuracy) consisted of
1. radius
2. texture
3. perimeter
4. smoothness
5. compactness
6. concavity
7. symmetry
8. fractal dimension
excluding area and concave points features (Fig. 4).
Instances with the shortlisted features are separable by a plane .z = .85x −2y −0.4
at an accuracy of 95.4%.
Although, falling short of the accuracy (97.5%) reported in [10], our method
manages to present the dimensionality reduction capacity of a simple and invariant
method that is solely based on manipulation of orderings.
As shown in the experiments results reported in this paper, our method can be
used for unsupervised clustering of multidimensional data as well as in feature
selection processes in cluster analysis. Our method allows us to have a better
understanding of how multidimensional features contribute to classification of
samples in an invariant and explainable manner and in some cases achieve unbiased
and unsupervised clustering in cluster analysis processes.
70 T. M. Chikake and B. Goldengorin

Fig. 4 Cartesian plot on resultant .3 × 1 dimensions after reduction by pseudo-Boolean polynomi-

als formulation and the best separating plane

6 Conclusion

In this paper, we managed to showcase a combinatorial method for dimensionality

reduction for cluster analysis, based on the formulation and reduction of pseudo-
Boolean polynomials.
We tested our method on simple datasets, and managed to show that we can
classify data samples with competitive accuracies by simple and linear data slices
(lines and planes). Our proposed solution is invariant and interpretable while
avoiding biases that may be involved when we use statistical methods because each
sample is reduced in an independent manner, solely based on its own description.
Dimensionality Reduction Using Pseudo-Boolean Polynomials for Cluster Analysis 71

It can be noted that dimension reduction using pseudo-Boolean polynomials

on high-level features is a powerful tool for lossless dimension reduction and has
potential of accelerating low memory representation of complex data that empowers
other complex tasks like interpretable unsupervised clustering in computer vision,
bioinformatics, natural language processing and machine learning.
We expect to reproduce even better state-of-the-art accuracies on other data
science tasks, when we apply more powerful tools like decision trees and artificial
neural networks on instances in their reduced pseudo-Boolean polynomials forms.
The biggest takeaway from our findings is the invariability and interpretability
nature of the dimension reduction process using pseudo-Boolean polynomials
formulation.

Acknowledgments Tendai Mapungwana Chikake and Boris Goldengorin’s research was sup-
ported by Russian Science Foundation project No. 21-71-30005.
Boris Goldengorin acknowledges Scientific and Educational Mathematical Center “Sofia
Kovalevskaya Northwestern Center for Mathematical Research” for financial support of the present
study (agreement No 075-02-2023-937, 16.02.2023).

References

1. Schölkopf, B., Smola, A., Müller, K.R.: Nonlinear component analysis as a Kernel eigenvalue
problem. Neural Comput. 10(5), 1299–1319 (1998)
2. Johnstone, I., Titterington, D.: Statistical challenges of high-dimensional data. Philos. Trans.
A Math. Phys. Eng. Sci. 367, 4237–53 (2009)
3. Notley, S., Magdon-Ismail, M.: Examining the use of neural networks for feature extraction:
a comparative analysis using deep learning, support vector machines, and K-nearest neighbor
classifiers. arXiv preprint arXiv:1805.02294 (2018)
4. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds,
Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D.,
Wattenberg, M., Olah, C.: Toy Models of Superposition. arXiv e-prints arXiv:2209.10652
(2022). https://doi.org/10.48550/arXiv.2209.10652
5. Burnham, K.P., Anderson, D.R., Burnham, K.P.: Model Selection and Multimodel Inference:
A Practical Information-Theoretic Approach, 2nd edn. Springer, New York (2002)
6. Kopp, W., Akalin, A., Ohler, U.: Simultaneous dimensionality reduction and integration for
single-cell ATAC-seq data using deep learning. Nat. Mach. Intell. 4(2), 162–168 (2022)
7. Sarveniazi, A.: An actual survey of dimensionality reduction. Am. J. Comput. Math. 04(02),
55–72 (2014)
8. Sinharay, S.: An overview of statistics in education. In: Peterson, P., Baker, E., McGaw, B.
(eds.) International Encyclopedia of Education, 3rd edn., pp. 1–11. Elsevier, Oxford (2010)
9. Goldengorin, B., Krushinsky, D., Pardalos, P.M.: Cell Formation in Industrial Engineering,
Springer Optimization and Its Applications, vol. 79. Springer, New York (2013)
10. Street, W.N., Wolberg, W.H., Mangasarian, O.L.: Nuclear feature extraction for breast tumor
diagnosis. In: Acharya, R.S., Goldgof, D.B. (eds.) Biomedical Image Processing and Biomed-
ical Visualization, vol. 1905, pp. 861–870. International Society for Optics; Photonics; SPIE
(1993)
11. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugenics 7(2),
179–188 (1936)
72 T. M. Chikake and B. Goldengorin

12. Maaten, L.J.P.v.d., Hinton, G.E.: Visualizing high-dimensional data using t-SNE. J. Mach.
Learn. Res. 9(Nov), 2579–2605 (2008)
13. Jolliffe, I.: Principal Component Analysis. Springer (2002)
14. Jiang, H., Eskridge, K.M.: Bias in principal components analysis due to correlated observa-
tions. In: Conference on Applied Statistics in Agriculture (2000)
15. Bengio, Y., Monperrus, M., Larochelle, H.: Nonlocal estimation of manifold structure. Neural
Comput. 18(10), 2509–2528 (2006)
16. McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. Wiley Series in
Probability and Mathematical Statistics. Wiley, New York (1992)
17. Singh, S., Silakari, S.: Generalized discriminant analysis algorithm for feature reduction in
cyber-attack detection system. Int. J. Comput. Sci. Inform. Secur. 6(1), 173–180 (2009)
18. McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection
for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)
19. Boros, E., Hammer, P.L.: Pseudo-boolean optimization. Discrete Appl. Math. 123(1–3), 155–
225 (2002)
20. AlBdaiwi, B.F., Ghosh, D., Goldengorin, B.: Data aggregation for p-median problems. J.
Comb. Optim. 21(3), 348–363 (2011)
21. Sarkar, T.: Xbnet: An extremely boosted neural network. Intell. Syst. Appl. 15, 200097
(2022). https://doi.org/10.1016/j.iswa.2022.200097. https://www.sciencedirect.com/science/
article/pii/S2667305322000370
22. William Wolberg, O.M.: Breast Cancer Wisconsin (Diagnostic) (1993). https://doi.org/10.
24432/C5DW2B. https://archive.ics.uci.edu/dataset/17
23. Bennett, K.P.: Decision tree construction via linear programming. Computer Sciences Techni-
cal Report # 1067, University of Wisconsin, 14 pp., January (1992)
Pseudo-Boolean Polynomials Approach to
Edge Detection and Image Segmentation

Tendai Mapungwana Chikake, Boris Goldengorin, and Alexey Samosyuk

1 Introduction

In digital image processing and computer vision, image segmentation is the process
of partitioning a digital image into multiple image segments, also known as image
regions or image objects [1]. The process has close relationship to blob extraction,
which is a specific application of image processing techniques, whose purpose is to
isolate (one or more) objects (aka. regions) in an input image [2].
The goal of segmentation is to simplify and/or change the representation of an
image into something that is more meaningful and easier to analyse [1].
There exist classical and AI based methods for the purpose of segmentation/blob-
extraction. These methods converge into semantic [3], instance [4] and panoptic [5]
segmentation. The underlying techniques of these methods can be classified into
threshold filtering, clustering, differential motion subtraction, histogram, partial-
differential equation solving, graph partitioning, supervised neural-network asso-
ciation and edge detecting methods. Our proposed method lies at the intersection of
edge detection and threshold filtering methods.

T. M. Chikake · A. Samosyuk
Department of Discrete Mathematics, Moscow Institute of Physics and Technology, Moscow,
Russian Federation
e-mail: [email protected]; [email protected]
B. Goldengorin ()
Department of Mathematics, New Uzbekistan University, Tashkent, Uzbekistan
The Scientific and Educational Mathematical Center «Sofia Kovalevskaya Northwestern Center
for Mathematical Research», Pskov State University, Pskov, Russia
Department of Discrete Mathematics, Phystech School of Applied Mathematics and Informatics,
Moscow Institute of Physics and Technology, Dolgoprudny, Moscow region, Russia
e-mail: [email protected]; [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 73

Image segmentation plays a pivotal role, as a preprocessing step in localizing

region of interest in content-based image retrieval, anomaly detection in industrial
imagery, medical imaging, object detection and recognition tasks.
Our proposed method utilise the reduction, equivalence and degree properties
of penalty-based pseudo-Boolean polynomials for purposes of extracting regions
of interest which provide course segmentation that can be extended to image
segmentation. We derive our methodology from [6]’s work on penalty-based
pseudo-boolean formulation for purposes of data aggregation on p-median problems
in the context of image processing.
Our method bases its operation on a combination of threshold filtering and
edge detection by deterministically grouping masks which convey regions of colour
gradient shift.
Localizing regions of interest in a blind manner is a difficult task, often requiring
pattern recognition of big image data to derive useful utility. Our approach avoids
learning of patterns from data and operate in a deterministic manner.
In our method, we partition an input image into small patches which we treat as
information cost matrices. We aggregate these matrices into their smallest possible
pseudo-Boolean polynomials and group together equivalent instances, thereby
delineating spatially contrasting regions in image representations.
We verify our method by applying it to simple images containing primitive
shapes and then scale up to simple natural scene images where we show that the
method can competitively extract segments.
Our proposed method seeks to
• introduce a blind preprocessing step for semantic segmentation
• assist with unsupervised annotation of segmentation datasets
• promote deterministic approaches to computer vision solutions.
The proposed method requires tuning of three parameters to balance performance
and fine-tuned segmentation results.

2 Related Work

Currently, our proposed method acts as a supportive step to a final segmentation

processor. This means that our method is still a complementary processing tool to
complete an image segmentation task.
Complete semantic segmentation requires the clustering of parts of an image
together and proposing an object class while instance segmentation is concerned
with detecting and delineating each distinct object of interest appearing in an
image [7].
Our current solution cannot yet attach object classes but can distinctly delineate
some individual objects in an image. The method achieves this by separating
neighbouring pixels into blob or edge region based on the degree of a pseudo-
Boolean polynomial calculated on patches extracted from the image.
Pseudo-Boolean Polynomials Approach to Edge Detection and Image Segmentation 75

The resulting masks can be refined into classifications or delineations of distinct

objects by an additional process which we are still working on. Our end goal is to
achieve a solution for complete segmentation.
By the time of writing, popular solutions to instance and semantic segmentations
are mostly based on either Mask R-CNN [8] or the U-Net Convolutional Network
[9]. Both solutions are learning-based methods and require large amounts of labelled
ground truth data. Neural network based methods also require accelerated computer
processors to perform in reasonable time. Our solution avoids learning rules from
data and can be easily run on non-accelerated CPUs.
Since our method is based on classifying edge and blob regions in an image,
the Canny edge detector [10], introduced in 1986 by John F. Canny, is a closely
related technique. The Canny edge detector is an edge detection operator which
uses a multi-stage algorithm to detect a wide range of edges in images. The Canny
method detects edges by first applying a Gaussian filter to smooth the image in order
to remove the noise, followed by the finding of the intensity gradients in the image
[11]. After these steps, the method applies a gradient magnitude threshold filtering
or lower bound cut-off suppression to get rid of spurious response to edge detection
and then apply a double threshold to determine potential edges and conclude its
process by tracking edges by hysteresis [10]. The Canny algorithm is adaptable to
various environments because its parameters allow it to be tailored to recognition
of edges of differing characteristics depending on the particular requirements of
a given implementation [11]. The optimised Canny filter is recursive, and can be
computed in a short, fixed amount of times, but the implementation of the Canny
operator does not give a good approximation of rotational symmetry and therefore
gives a bias towards horizontal and vertical edges [11]. In contrast, our method,
calculate the pseudo-Boolean polynomial in normal and transposed patch matrices
and selects the pseudo-Boolean polynomial form with the highest degree during the
blob/edge classification step, thereby avoiding the edge direction bias.
From an indirect perspective we can look at blind aggregation of possible
distinct objects in image data in terms of blob extraction. Popular solutions using
this paradigm, have underlying usage of either of the popular blob extraction
methods which include Laplacian of Gaussian [12], Difference of Gaussian [13],
or Determinant of a Hessian [14], among other methods.
Laplacian of Gaussian is a blob extraction method which determines the blobs
by using the Laplacian of Gaussian filters [12]. The Laplacian is a 2-D isotropic
measure of the second spatial derivative of an image which highlights regions of
rapid intensity change and is therefore, often used for edge detection [12]. We
often apply the Laplacian after an image smoothening method with something
approximating a Gaussian smoothing filter in order to reduce its sensitivity to noise.
In our method, the Gaussian smoothing filter step is an optionally used when applied
to noisy image instances.
The Difference of Gaussian method determines blobs by using the difference of
two differently sized Gaussian smoothed images and follows generally most of the
concept of the Laplacian of Gaussian [12].
76 T. M. Chikake et al.

Determinant of a Hessian to aggregate regions of possible segmentation is

achieved by determining blobs using the maximum in the matrix of the Hessian
determinant [14].
Generally, blob extraction based methods, propose small and numerous regions
of interest making them hard to extend into spatial segmentation.
On the other hand, our method aggregates equivalent regions, by assigning
zero or pseudo-Boolean polynomials with lower degree as blob regions and edge
otherwise. Regions initially described in different colour distributions in the pixel
array often output high-order pseudo-Boolean polynomials which indicate contour
regions. This property allows us to extract larger and fewer spatial regions of interest
in an image, making our method stand out against the other blob aggregation
methods.

3 Methods

In mathematics and optimization, a pseudo-Boolean function is a function of the

form .f : Bn → R, where .B = {0, 1} is a Boolean domain and .n is a non-negative
integer called the degree of the function [15].
Goldengorin et al. [16] highlights fundamental reduction and equivalence prop-
erties of penalty-based pseudo-Boolean polynomials which guarantee the mainte-
nance of underlying initial information while reducing the problem complexity.
We utilise this formulation on image patches to achieve compact representations
of patches, and based on the degrees and equivalence [16] properties of the resultant
pseudo-Boolean polynomials, we can qualify an edge/blob classifier on input
patches.
Given an image patch represented by an .m × n sized matrix which we treat as an
information cost matrix .C, our first task is to determine the minimum possible way
of representing this cost in a way that allows us to compare if the patch is extracted
from a blob region or a contour region.
A patch that overlaps regions of contrasting information (i.e. overlapping an
edge) in an image, results in a representation that is costly in comparison to one
that lies over a blob region. We claim this assertion because matrix values in blob
regions are usually equivalent or have very small differences between each other
and since [16]’s pseudo-Boolean polynomial formulation is penalty-based, similar
costs cancel out.
Reducing the patches to their smallest pseudo-Boolean polynomials provides
comparable instances which can be tested for equivalence as well. This property
is termed equivalence in [16] and can be used to fine-tune edges or detect similar
regions on the image matrix.
The Pseudo-Boolean representation according to [6] requires the generation of
a coefficients matrix and its respective terms’ matrix, whose combination creates
monomials of the pseudo-Boolean polynomial.
Pseudo-Boolean Polynomials Approach to Edge Detection and Image Segmentation 77

After the generation of these matrices, most of the processing: local aggregation,
reduction of columns, and p-truncation processes are heavily dependent on the
terms’ matrix.
Using an example instance to illustrate the formulation process, we take a .4 × 5
patch from an image.
Let
⎡ ⎤
8 8 8 5
⎢12 7 5 7⎥
.C = ⎢ ⎥
⎣18 2 3 1⎦
5 18 9 8

The terms encoding function takes as input the permutations (.) matrix which is
an index ordering of the input matrix.
⎡ ⎤
4 3 3 3
⎢1 2 2 1⎥
. = ⎢ ⎥
⎣2 1 1 2⎦
3 4 4 4

Using ., we sort the initial cost matrix .C

⎡ ⎤
5 2 3 1
⎢8 7 5 5⎥
.sorted C = ⎢ ⎥
⎣12 8 8 7⎦
18 18 9 8

and derive the .ΔC matrix

⎡ ⎤
5 2 31
⎢3 5 2 4⎥
.C = ⎢ ⎥
⎣4 1 3 2⎦
6 10 11

Using the . matrix we calculate the terms’ matrix

⎡ ⎤
⎢ y4 y3 y3 y3 ⎥
y=⎢
.
⎣ y1 y4 y2 y3 y2 y3 y1 y3 ⎦
⎥

y1 y2 y4 y1 y2 y3 y1 y2 y3 y1 y2 y3
78 T. M. Chikake et al.

and derive the resulting pseudo-Boolean polynomial

⎡ ⎤
5 2 3 1
⎢ 3y4 5y3 2y3 4y3 ⎥
.⎢ ⎥
⎣ 4y1 y4 1y2 y3 3y2 y3 2y1 y3 ⎦
6y1 y2 y4 10y1 y2 y3 1y1 y2 y3 1y1 y2 y3

We then perform local aggregation by summing similar terms and get a compact
representation of the initial instance as
⎡ ⎤
0 0 11
⎢ 0 11y3 3y4 ⎥
.⎢ ⎥
⎣2y1 y3 4y2 y3 4y1 y4 ⎦
0 12y1 y2 y3 6y1 y2 y4

which in this particular example has 50% less cost compared to the initial instance
when expressed as polynomial.
We then search for the equivalent matrix with the minimum number of columns
and in this particular example, it is already reduced to this state.
When pseudo-Boolean polynomials are reduced to their smallest instance, an
equivalence property is apparent because different instances of similar information
converge into a similar reduced pseudo-Boolean polynomials.
This property is central to the blob aggregation task, and in this task these regions
of equivalence in the reduced pseudo-Boolean polynomials context, occupy the set
of blobs.
Below are examples of cost matrices, which are initially different but converge
into similar reduced instances.
⎡ ⎤
138 138 138 136
⎢139 139 138 137⎥
.⎢ ⎥
⎣142 141 139 138⎦
142 140 139 138

and
⎡ ⎤
136 136 138 140
⎢138 137 138 140⎥
.⎢ ⎥
⎣140 139 140 141⎦
139 139 140 141

reduce to
Pseudo-Boolean Polynomials Approach to Edge Detection and Image Segmentation 79

Fig. 1 Pseudo-Boolean polynomial degree plotted for an image of primitive shapes

⎡ ⎤
550
⎢ 3y1 ⎥
.⎢ ⎥
⎣ 6y1 y2 ⎦
1y1 y2 y4

By plotting the degree of the pseudo-Boolean polynomials of each patch on a

surface plot, we observe contours available in spatial features of a given image as
shown in Fig. 1.
For our edge/blob classifier, we use the pseudo-Boolean polynomial degree .r to
classify whether the particular patch lies over a contour or a blob region. We select
.p < m to be the cut-off threshold for classification and based on this value, we can

alter how fine/course should our edges be. If the degree .r of the pseudo-Boolean
polynomial calculated on the patch is higher than .p then the patch lies over an edge
or blob otherwise. The edge/blob classifier if simply

edge if r < p,
.f (r, p) = (1)
blob otherwise
80 T. M. Chikake et al.

Given an image, broken into small patches of size .4 × 4, the maximum possible
degree is .3, and we can select .p = 1, such that we have a binary set of patches,
where patches, whose pseudo-Boolean polynomials reduce to a constant are
described as blob regions, and the rest as edge points.
Using equivalence [6], we can fine-tune the classified edges, should neighbouring
patches exhibit contrasting edge/blob classes by equating the blobs in favour of the
edge or vice-versa depending on how fine we require our edges to be.

4 Experimental Setup

Given an image of size .200 × 200 containing basic primitive shapes of continuous
colour shown in Fig. 2a, we extract patches of significantly smaller sizes, e.g. .6 × 6
as shown in Fig. 2b.
We then apply the formulation and reduction of pseudo-Boolean polynomials on
each patch and group them into a binary set .S = {Blob, Edge} based on the pseudo-
Boolean polynomial degree.
Patches whose pseudo-Boolean polynomial degree .r < p are considered Blobs
and Edge otherwise.
Patches with constant pixel values .x ∈ [0, 255] like
⎡ ⎤
99 99 99 99
⎢99 99 99 99⎥
.⎢ ⎥
⎣99 99 99 99⎦
99 99 99 99

are guaranteed to converge to constant (zero-degree pseudo-Boolean polynomials),

. 396

Fig. 2 Input image of primitives and its patching process. (a) Input image. (b) Showing .6 × 6
sized patches
Pseudo-Boolean Polynomials Approach to Edge Detection and Image Segmentation 81

while some, with varied pixel values which cancel out like
⎡ ⎤
254 254 19 84
⎢254 254 19 84⎥
.⎢ ⎥
⎣254 254 19 84⎦
254 254 19 84

also converge to constant (zero-degree pseudo-Boolean polynomials)

. 611

Consequently these patches are grouped among the set of patches whose informa-
tion cost is described as blob region.
There are patches which contain varied pixel data which may converge to high-
order pseudo-Boolean polynomials like
⎡ ⎤
254 254 6 17
⎢254 254 6 17 ⎥
.⎢ ⎥
⎣254 254 6 17 ⎦
254 254 6 123

and
⎡ ⎤
254 254 6 17
⎢254 254 6 123⎥
.⎢ ⎥
⎣254 254 6 123⎦
254 254 6 123

which converge to

531
.
106y1 y2 y3

a 3rd-degree pseudo-Boolean polynomial which would be classified as lying on edge

regions, should the cut-off threshold be given as .r = 2. In the particular simple case
of the image in Fig. 2a, patches which lie at shape edges converge to these non-zero
degree pseudo-Boolean polynomials.
By colour-coding the patches which belong to the set of patches which have
pseudo-Boolean polynomial’s degree .r = 0, we can observe regions of colour
continuation, which we can classify as blobs as shown in Fig. 3.
As can be observed from Fig. 3, the method allows us to find border points (white
coloured) on simple objects found in the image.
Applying the same process on a natural and unprocessed image may not
result in desirable or useful aggregation as natural images often have less drastic
82 T. M. Chikake et al.

Fig. 3 Colour coding blue to discovered, zero-degreed pseudo-Boolean polynomials of input

instances

transition of spatial features. For this reason we ported the use of a Gaussian
filter as a preprocessing step, followed by a pixel set aggregation step which is
essentially a multivalued threshold processing. Instead of raw pixels as input in our
patches, we group ranges into sets of pixels whose size depends on the variance of
pixel distribution in the image to promote group convergence of pseudo-Boolean
polynomials.
We create these .f (x) = sets of pixels as
⎧
⎪
⎪0 if x < 5,
⎪
⎪
⎪
⎪
⎪
⎪1 if x ∈ [5, 10),
⎪
⎪
⎪
⎪ if x ∈ [10, 15),
⎨2
.f (x) = ... (2)
⎪
⎪
⎪
⎪
⎪
⎪...
⎪
⎪
⎪
⎪...
⎪
⎪
⎩
51 if x > 250,
Pseudo-Boolean Polynomials Approach to Edge Detection and Image Segmentation 83

thereby reducing the information cost range from .[0, 255] to .[0, 51] for instance.
Natural images tend to have smooth transitions of pixel values for neighbouring
pixels at atomic level due to anti-aliasing in RGB representation of image data,
thereby reducing the chances of neighbouring pixel patches converging into equiv-
alent groups and consequently coarse edges but the Gaussian filter preprocessing
together with the grouping of pixel ranges promotes finer edges.

5 Results and Discussion

We apply our aggregation process on natural images and observe the need for the
Gaussian filter as well as the pixel set aggregation preprocessing to achieve useful
segmentation. Figure 4 shows input image of a noisy and natural image that we pass
through the segmentation processes without preprocessors as shown in Fig. 5, and
with processors as shown in Fig. 6.
As can be observed in Fig. 6, a Gaussian preprocessing and pixel grouping step
allow us to achieve better edge and blob extraction. A Set of pixels, each of size
40, limit our cost range from .[0, 255] to .[0, 7] and encourage pronunciation of
contrasting regions.
Additionally, we can observe that including a costly operation which aggregates
those reduced pseudo-Boolean polynomials into equivalent groups finds numerous,
insignificantly small and unuseful groups in the setups which exclude the prepro-
cessing steps, while larger equivalent groups can be aggregated in the setups which
involve all the preprocessing steps.

Fig. 4 Input natural image

for preprocessing comparison
84 T. M. Chikake et al.

Fig. 5 Segmentation without the pre-processing steps

5.1 Dubai Landscape Dataset

We apply our method with selected processing parameters on the Dubai landscape
dataset [17]. In this experiment we show that our method can extract edges of land-
scape features on satellite imagery which can be used for semantic segmentation.
Humans in the Loop published an open access dataset annotated for a joint
project with the Mohammed Bin Rashid Space Center in Dubai, UAE, which
consists of aerial imagery of Dubai obtained by MBRSC satellites and annotated
with pixel-wise semantic segmentation in 6 classes [17].
The full solution on this dataset requires placing labels on each segmented region,
however our current method can only segment regions on boundary edges.
The distinctive advantages of our method against the state-of-art neural network
based solutions in instance segmentation are:
• blind segmentation(no learning is involved, which can be prone to overfit-
ting/under fitting issues)
• faster and CPU friendly segmentation
• explainable mathematical steps to segmentation.
Pseudo-Boolean Polynomials Approach to Edge Detection and Image Segmentation 85

Fig. 6 Segmentation with all the pre-processing steps included

Our method is also not limited to a given dataset, since it works in a blind manner,
which does not require prior familiarity of related image data except for purposes
of choosing the threshold ranges. Provided with images which contain contrasting
features, we can guarantee that our method will propose segmentations of features
in pure mathematical and deterministic steps.
Figure 7 show an example processing of our method on an image sample in the
Dubai Landscape dataset [17].
Our method brings to limelight unbiased and fast computer vision in a segmenta-
tion task. The parameters required to achieve useful segmentation using our method
are limited to:
1. Gaussian filter kernel size,
2. pixel thresholding size and,
3. patch sizes.
The optimal choice of these parameters is the only limitations for the general-
ization of our proposed method, and we propose in our future work, an automized
process for selecting these parameters.
86 T. M. Chikake et al.

Fig. 7 Example segmentation [Dubai landscape]

The performance of the method, based on the choice of the patch size parameter
is linearly dependent: the smaller the patch size, the finer the segmentation but
longer processing, and vice-versa.

6 Conclusion

In this article, we presented our proposed method of formulating pseudo-Boolean

polynomials on image patches which results in unsupervised edge detection, blob
extraction and image segmentation processes. We managed to show that our
proposed method, works in a fast, unbiased and competitively accurate manner in
segmenting contrasting regions in image data. We plan to automate the choosing of
processing parameters so that our method achieves full functionality as a general
purpose image segmentation tool and one of the major tasks in our next challenges
is focused on grouping blob regions based on colour histograms to provide labels
and consequently achieve complete semantic segmentation.

References

1. Shapiro, L.G., Stockman, G.C.: Computer Vision. Prentice Hall, Upper Saddle River, NJ
(2001)
2. Yusuf, M.D., Kusumanto, R., Oktarina, Y., Dewi, T., Risma, P.: Blob analysis for fruit
recognition and detection. Comput. Eng. Appl. J. 7(1), 25–36 (2018)
Pseudo-Boolean Polynomials Approach to Edge Detection and Image Segmentation 87

3. Guo, D., Pei, Y., Zheng, K., Yu, H., Lu, Y., Wang, S.: Degraded image semantic segmentation
with dense-gram networks. IEEE Trans. Image Process. 29, 782–795 (2020)
4. Yi, J., Wu, P., Jiang, M., Huang, Q., Hoeppner, D.J., Metaxas, D.N.: Attentive neural cell
instance segmentation. Med. Image Anal. 55, 228–240 (2019)
5. Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation arXiv.org
(2018). https://arxiv.org/abs/1801.00868v3
6. AlBdaiwi, B.F., Ghosh, D., Goldengorin, B.: Data aggregation for p-median problems. J.
Comb. Optim. 21(3), 348–363 (2011)
7. Chennupati, S., Narayanan, V., Sistu, G., Yogamani, S., Rawashdeh, S.A.: Learn-
ing panoptic segmentation from instance contours (2021). http://arxiv.org/abs/2010.11681.
ArXiv:2010.11681 [cs]
8. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN (2018). https://doi.org/10.48550/
arXiv.1703.06870. http://arxiv.org/abs/1703.06870. ArXiv:1703.06870 [cs]
9. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image
segmentation. Tech. Rep. arXiv:1505.04597 (2015)
10. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell.
PAMI-8(6), 679–698 (1986)
11. Liu, K., Xiao, K., Xiong, H.: An image edge detection algorithm based on improved canny. In:
Proceedings of the 2017 5th International Conference on Machinery, Materials and Computing
Technology (ICMMCT 2017). Atlantis Press, Beijing, China (2017)
12. Kong, H., Akakin, H.C., Sarma, S.E.: A generalized Laplacian of Gaussian filter for blob
detection and its applications. IEEE Trans. Cybern. 43(6), 1719–1733 (2013)
13. Assirati, L., Silva, N.R., Berton, L., Lopes, A.A., Bruno, O.M.: Performing edge detection by
difference of Gaussians using q-Gaussian kernels. J. Phys. Conf. Ser. 490, 012020 (2014)
14. Li, S., Liu, C., Wang, Y. (eds.): Pattern Recognition: 6th Chinese Conference, CCPR 2014,
Changsha, China, November 17–19, 2014. Proceedings, Part I, Communications in Computer
and Information Science, vol. 483. Springer Berlin Heidelberg, Berlin (2014)
15. Boros, E., Hammer, P.L.: Pseudo-boolean optimization. Discrete Appl. Math. 123(1–3), 155–
225 (2002)
16. Goldengorin, B., Krushinsky, D., Pardalos, P.M.: Cell Formation in Industrial Engineering,
Springer Optimization and Its Applications, vol. 79. Springer New York, New York (2013)
17. Loop, H.i.t.: Semantic segmentation dataset | humans in the loop (2020). https://
humansintheloop.org/resources/datasets/semantic-segmentation-dataset-2/. Accessed: 2022-
12-03
Purifying Data by Machine Learning
with Certainty Levels

Shlomi Dolev and Guy Leshem

1 Introduction

Motivation A fundamental paradigm used for autonomic computing, self-

managing systems, and decision-making under uncertainty and faults is machine
learning. Classification of machine learning algorithms that are designed to deal
with Byzantine (or malicious) data are of great interest since a realistic model of
learning from examples should address the issue of Byzantine data. Previous work,
as described below, tried to cope with this issue by developing new algorithms
using a boosting algorithm (e.g., “AdaBoost”, “Logitboost” etc.) or other robust and
efficient learning algorithms e.g., [13]. These efficient learning algorithms tolerate
relatively high rates of corrupted data. In this paper we try to handle the issue
using a different approach, that of introducing the certainty level measure as a tool
for coping with corrupted data items, and of combining learning results in a new
and unique way. We present two new approaches to increase the certainty levels
of machine learning results by calculating a certainty level that takes into account
the corrupted data items in the training data-set file. The first scheme is based
on identifying statistical parameters when the distribution is known (e.g., normal
distribution) and using an assumed bound on the number of corrupted data items
to bound the uncertainty in the classification. The second scheme uses decision
trees, similar to the random forest techniques, incorporating the certainty level
to the leaves. The use of the certainty level measure in the leaves yields a better

S. Dolev
Ben-Gurion University of the Negev, Be’er Sheva City, Israel
e-mail: [email protected]
G. Leshem ()
Ashkelon Academic College, Ashkelon, Israel
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 89

collaborative classification when results from several trees are combined to a final
classification.
Previous Work In the Probably Approximately Correct (PAC) learning frame-
work, Valiant [14] introduced the notion of PAC learning in the presence of
malicious noise. This is a worst-case model of errors in which some fraction
of the labeled examples given to a learning algorithm may be corrupted by an
adversary who can modify both example points and labels in an arbitrary fashion.
The frequency of such corrupted examples is known as the malicious noise rate.
This study assumed that there is a fixed probability β (0 < β < 1) of an error
occurring independently on each request, but the error is of an arbitrary nature. In
particular, the error may be chosen by an adversary with unbounded computational
resources and knowledge of the function being learned, the probability distribution
and the internal state of the learning algorithm (note that in the standard PAC model
the learner has access to an oracle returning some labeled instance (x,C(x))for each
query, where C(x) is some fixed concept belonging to a given target class C and x
is a randomly chosen sample drawn from a fixed distribution D over the domain X.
Both C and D are unknown to the learner and each randomly drawn x is independent
of the outcomes of the other draws.
In the malicious variant of the PAC model introduced by Kearns and Li [8], the
oracle is allowed to ‘flip a coin’ for each query with a fixed bias η for heads. If
the outcome is heads, the oracle returns some labeled instance (x,) antagonistically
chosen from X × {−1,+1}. If the outcome is tails, the oracle is forced to behave
exactly like in the standard model returning the correctly labeled instance (x,C(x))
where x ∼ D (x is a drawn sample from the distribution D).
In both the standard and malicious PAC models the learner’s goal for all inputs
ε, > 0 is to output some hypothesis H ∈ H (where H is the learner’s fixed
hypothesis class) by querying an oracle at most m times for some m = m(ε, )
in the standard model, and for some m = m(ε, , η) in the malicious model. For
all targets C ∈ C and distributions D, the hypothesis H of the learner must satisfy
Ex∼D [H (x) = C(x)] ≤ ε with a probability of at least 1 − with respect to
the oracle’s randomization. We will call ε and the accuracy and the confidence
parameter, respectively. Kearns and Li [8] have also shown that for many classes
of Boolean functions (concept classes), it is impossible to accurately learn ε if the
ε
malicious noise rate exceeds 1+ε . In fact, for many interesting concept classes, such
as the class of linear threshold functions, the most efficient algorithms known can
only tolerate malicious noise rates significantly lower than this general upper bound.
Despite these difficulties, the importance of being able to cope with noisy data
has led many researchers to study PAC learning in the presence of malicious noise
[1–3, 5, 6, 9, 13]. In Servedio [13], a PAC boosting algorithm is developed using
smooth distributions. This algorithm can tolerate low malicious noise rates but
requires access to a noise-tolerant weak learning algorithm of known accuracy. This
weak learner, L, which takes as input a finite sample S of m labeled examples,
has some tolerance to malicious noise; specifically, L is guaranteed to generate
a hypothesis with non-negligible advantage provided that the frequency of noisy
Purifying Data by Machine Learning with Certainty Levels 91

examples in its sample is at most 10% and that it has a high probability to learn with
high accuracy in the presence of malicious noise at a rate of 1%.
Our Contribution We present a verifiable way to cope with arbitrary faults
introduced by even the most sophisticated adversary, and show that the technique
withstands this malicious (called Byzantine) intervention so that even in the worst
case scenario the desired results of the machine learning algorithm can be achieved.
The assumption is that an unknown part of a data-set is Byzantine, namely,
introduced to mislead the machine learning algorithm as much as possible. Our goal
is to show that we can ignore/filter the influence of the misleading portions of the
malicious data-set and obtain meaningful (machine learning) results. In reality, the
Byzantine portion in the data-set may be introduced by a malfunctioning device with
no adversarial agenda, nevertheless, a technique proven to cope with the Byzantine
data items will also cope with less severe cases. In this paper, we develop three new
approaches for increasing the certainty level of the learning process, where the first
two approaches identify and/or filter data items that are suspected to be Byzantine
data items in the data-set (e.g., a training file). In the third approach we introduce
the use of the certainty level for combining machine learning techniques (similar to
the previous studies).
The first approach fits best the case in which the Byzantine data is added to the
data-set, and is based on the calculation of the statistical parameters of the data-set.
The second approach considers the case where part of the data is Byzantine, and
extends the use of the certainty level for those cases in which no concentrations
of outliers are identified. Data-sets often have several features (or attributes) which
are actually columns in the training and test files that are used for cross-checks
and better prediction of the outcome in both simple and sophisticated scenarios.
The third approach deals with cases in which the Byzantine data is part of the data
and appear in two possible modes: where part of the data in a feature is Byzantine
and/or where several features are entirely Byzantine. The third technique is based
on decision trees similar to the Random Forest algorithm [4]. After the decision
trees are created from the training data, each variable from the training data passes
through these decision trees, and whenever the variable arrives to a tree leaf, its tree
classification is compared with its class. When the classification and the class are
in agreement, a right variable of the leaf is incremented; otherwise, the value of a
wrong variable of this leaf is incremented. The final classification for every variable
will be determined according to the right and wrong values. This enhancement of the
random forest is of an independent interest conceptually and practically, improving
the well known random forest technique.
Road Map The rest of the paper is organized as follows: In the next section
(Sect. 2), we describe approaches for those cases in which Byzantine data items
are added to the data-set, and the ways to identify statistical parameters when the
distribution of a feature is known. In Sects. 3 and 4, we present those cases in
which the Byzantine adversary receives the data-set and chooses which items to
add/corrupt. Section 3 describes ways to cope with Byzantine data in the case of a
single feature with a classification of a given certainty level. Section 4 extends the
92 S. Dolev and G. Leshem

use of the certainty level to handle several features, extending and improving the
random forest techniques. The conclusion appears in Sect. 5.

2 Addition of Byzantine Data

We start with the cases in which Byzantine data is added to the data-set. Our goal
is to calculate the statistical parameters of the data-set, such as the distribution
parameters of the uncorrupted items in the data-set, despite the addition of the
Byzantine data. Consider the next examples that derive the learning algorithm to
the wrong classification, where the raw data contains one feature (or attribute) of
the samples (1 vector) that obeys some distribution (e.g., normal distribution), plus
additional adversary data. The histogram that describes such an addition is presented
on the left side of Fig. 1, where the “clean” samples are inside the curve and the
addition of corrupted data is outside the curve (marked in blue). The corrupted
data items in these examples are defined as samples that cause miscalculation of
statistical parameters like .μ and .σ and as a result, the statistical variables are less
significant. Another case of misleading data added to the data-set, a special case to
the one above, is demonstrated on the right side of Fig. 1. The histogram of these
samples is marked in green, where the black vertical line that crosses the histogram
separates samples with labels .+1 and .−1. The labels of the misleading data are
inverted with relation to the labels of other data items with the same value. To
achieve our goal to calculate the most accurate statistical parameters for the feature’s
distribution in the sample population, we describe a general method to identify and
filter the histograms that may include a significant number of additional corrupted
data items.

Fig. 1 Histogram of original samples with additional corrupted data outside the normal curve but
in the bound of .μ ± 3σ (left), and outside the normal curve and outside the bound of .μ ± 3σ (right)
Purifying Data by Machine Learning with Certainty Levels 93

Method for Identifying Suspicious Data and Reducing the Influence of Byzan-
tine Data This first approach is based on the assumption that we can separate
“clean” data by a procedure based on the calculation of the .μ and .σ parameters
of the uncorrupted data. According to the central limited theorem, 30 data items
chosen uniformly, which we call a batch, can be used to define the .μ and .σ . Thus,
the first step is to try to find at least 30 clean samples (with no Byzantine data).
Note that according to the central limit theorem, the larger the set of samples,
the closer the distribution is to being normal, therefore, one may choose to select
more than 30 samples. We use n=30 as a cutoff point and assume that the sampling
distribution is approximately normal. In the presence of Byzantine data one should
try to ensure that the set of 30 samples will not include any Byzantine items.
This case is similar to the case of a shipment of N objects (real data) in which
m are defective (Byzantine). In probability theory and statistics, hypergeometric
distribution describes the probability that in a sample of n distinctive objects drawn
from the shipment, exactly k objects are defective. The probability for selecting k
items that are not Byzantine is:
mN −m
k n−k
P (X = k) =
. N (1)
n

Note that for clean samples k=0 and the equation will be
N −m
P (X = 0) = Nn
. (2)
n

In order to prevent the influence of the adversary on the estimation of .μ and .σ (by
addition of Byzantine data), we require that the probability in equation 2 will be
higher than 50.% (.P > 12 ). Additionally, according to the Chernoff bound we will
obtain a lower bound for the success probability of the majority of n independent
choices of 30-sample batches (thus, by a small number of batch samplings we will
obtain a good estimation for the .μ and .σ parameters of clean batches). The ratio
between N (all samples) to m (Byzantine samples) that implies a probability to
sample a clean batch that is greater than . 12 is presented in Fig. 2.
As demonstrated in Fig. 2 the ratio between N (all samples), and m (Byzantine
samples) is about 2.% (e.g., 20 Byzantine samples for every 1000 samples, so if this
Byzantine ratio is found by the new method (as described below) the probability
that any other column in the data-set will contain a Byzantine sample is very low (in
other words, the confidence that in every other column the samples are “clean” is
high)). Our goal is to sample a majority of “clean” batches to estimate statistical
parameters such as .μ and .σ of the non-Byzantine samples in the data-set. The
estimation of these parameters will be done according to an Algorithm 1.
94 S. Dolev and G. Leshem

The procedure below:

Algorithm 1 Estimate statistical parameters

1. For 1 to the chosen B do (B will be selected according to the Chernoff bound. ),
2. Randomly and uniformly choose a batch of size n (e.g., .n = 30) from the population of
interest(e.g., one feature),
3. Compute the desired batch statistic .μ and .σ μ = n1 i=1 n x , and, σ =
i n−1 i=1 (xi − x) ,
1 n 2

4. end for
5. On the assumption that the distribution of the original data is normal or approximately normal,
the histogram of the estimated .μ and .σ is also approximately normal (according to the central
limit theorem). The probability to choose a “clean” batch is higher than 50.%, therefore, at least
50.% or more of the estimations are clean. The value of .μ̂ (and .σ̂ ) will be chose to be the median
of the .μ (and .σ ), thus ensuring that our choice has at least one clean batch with higher (and one
with lower) .μ (and .σ ,respectively).
. The Chernoff bound gives a lower bound for the success probability of majority agreement
for b independent, equally likely events, and the number of trials is determined according to the
following equation: .B ≥ 2(P −1/2)1
2 ln
√1 , where the probability .P > 1 and . is the smallest
2
probability that we can promise for an incorrect event (e.g., for the probability of a correct event
at a confidence level of 95.% or 99.%, the probability for an incorrect event, . , is 0.05 or 0.01,
respectively).

Algorithm 1: Description of a method for estimating statistical parameters like .μ

and .σ .
Using Expected Value and Variance to Predict Distribution Shape Up to this
stage, we used the central limit Theorem (CLT), stating that: the average samples of
observations uniformly drawn from some population with any distribution shape is
approximately distributed as a normal distribution, resulting in the expected value

Fig. 2 Ratio between N (all samples) to m (Byzantine samples) for .P ≥ 1

2
Purifying Data by Machine Learning with Certainty Levels 95

and the variance. Based on CLT, we were able to efficiently obtain (using Chernoff
bound) the expected value and the variance of the data item values. Next, for every
given number of data items, and type of distribution graph, the parameters of the
graph that will respect these values (expected value, variance, distribution type,
and number of data items) can be found. In the sequel, we consider the case of
a distribution type of graph which reflects the normal distribution. The next stage
for identifying suspicious data items is based on analysis of the overflow of data
items beyond the distribution curve (Fig. 1). The statistical parameters which were
found in the previous stage are used in the procedure described in Algorithm 2, the
procedure below:

Algorithm 2 Technique for removing suspected data

1. Take the original sample population of interest (e.g., one feature from the data set) and create a
histogram of that data,
2. Divide the histogram into . bins within the range .μ ± 3σ (e.g., . = 94), and count the actual
number of data items in every bin,
3. Compute the number of data items in every bin by using the integral of the normal curve
according to .μ and .σ (which were found by the previous method (Algorithm 1)) multiplied
by the number of “clean” samples. , and compare with the actual number,
4. If the ratio between the counted number of data items in a bin and the computed number
according to the integral is higher than 1+.ξ (e.g., .ξ =0.5), the data items in this column are
suspect,
5. The samples from the suspicious bins will be marked and will not be considered by the machine
learning algorithm.
. N, the number of the “clean” samples can be define to be 98.% of the total number f data items,
assuming the data-set contains at most 2.% Byzantine items. Alternately one may estimate the
number of the “clean” samples using the calculated .μ and .σ . We assume that batches .μ (and
.σ ) very near to the selected .μ (and .σ ) represent “clean” population. Thus, the bins from the
histogram with these values are probably clean. The ratio between the original number of data
items in this clean bin to the integral of the normal curve for this bin can be used as an estimation
for N.

Algorithm 2: Description of the first technique for removing suspected data items.
The suspicious bins, those with a significant overflow, are marked and will not be
considered for the training process of the machine learning. The data-set after the
cleaning process contains values from bins (in the data histogram) without overflow
(e.g., the ratio between the integral of the normal curve to the data items in the same
bin is approximately 1).
Note that when the number of extra data items in the bins (which was counted
during the “cleaning” process) with overflow (data items outside the integral curve)
is higher than 2.% of the whole data-set, we can assume that the other bins are clean.
The next section deals with the remaining uncertainty.
Experiments and Results We present results to demonstrate the qualities of our
new approach, and to reveal the differences between a data-set with and without
corrupted data in order to validate the proposed method experimentally. The
96 S. Dolev and G. Leshem

Table 1 Results of C4.5 Test No. Dataset Accuracy

algorithm, on original and
corrupted data 1 Artificial data 100.%
2 Artificial data 78.%
with corrupted data
3 Artificial data 96.%
after removing
suspicious data

comparison between the data-sets was done with the C4.5 algorithm [12]. C4.5
uses a decision tree developed by Quinlan [11], and is an extension of Quinlan’s
earlier ID3 (Iterative Dichotomiser 3) algorithm [10]. The decision trees generated
by C4.5 can be used for classification, and for this reason C4.5 is often referred to
as a statistical classifier. For this comparison we used an available tool based on the
C4.5 file from “Classification Toolbox for Matlab”. The results appear in Table 1.
The data-set for the first experiment is an artificial data-set created by Matlab
software with the function “normrnd” (which generates random numbers from a
normal distribution with a mean parameter .μ and a standard deviation parameter
.σ )—this database consist of uniformly distributed data and contains 1 attribute

(2296 samples) and 2 classes (.−1,.+1). Note that in the presence of Byzantine
data items the resulting data-set must be a worse case with relation to the original
data items and the learning algorithm. Here we deliberately introduced a corrupt
data-set that includes Byzantine data to demonstrate the benefits of the proposed
technique. For the second and third experiments, the previous database is extended
with corrupt data (such as the data-set shown in Fig. 1) that also has an inverted
label in relation to the label with the same value (2296 “clean” plus 45 “corrupted”,
2341 samples in total).

3 Corruption of Existing Data, Single Feature Learning with

a Certainty Level

We continue considering the case where part of the data in the feature is corrupted.
Our goal in this section is to find the certainty level of every sample in the
distribution in the case where the upper bound on a number of corrupted data items
is known. This section is actually a continuation of the previous, as both sections
deal with a single feature, where the first deals with an attempt to find overflow of
samples and the second, cope with unsuccessful such attempts; either due to the
fact that the distribution is not known in advance, or that no overflows are found.
The histogram of these samples is colored green, where the black vertical line that
crosses the histogram separates samples with labels .+1 and .−1. The labels of the
Byzantine data have an inverted label with relation to the label of the non-Byzantine
Purifying Data by Machine Learning with Certainty Levels 97

data items with the same value. To achieve our goal we describe a general method
that bounds the influence of the Byzantine data items.
Method to Bound the Influence of the Byzantine Data Items The new approach
is based on the assumption that an upper .ξ on the number of Byzantine data items
that may exist in every bin in the distribution is known (e.g., maximum .ξ equals 8
items). The certainty level .ζ of each bin is calculated by the following equations:

L−1 − ξ
ζ−1 =
. (3)
N
L+1 − ξ
ζ+1 =
. (4)
N
Where .L−1 is the number of data items that are labeled as .−1, .L+1 is the number
of data items that are labeled as .+1, and N is the number of data items in the bin.

Algorithm 3 Finding the certainty level

1. Take the original sample of size n from the population of interest (e.g., one feature from the
data set),
2. Sort the n data items (samples) according to their value and create their histogram,
3. Count data items at every bin, where the size of bin is the value of natural number in the
histogram .± 0.5 (e.g., for the natural number 73, the bin is between 72.5 to 73.5) and count the
number of data items that are labeled as .−1 and .+1.
4. Find the certainty level .ζ of each bin according to equations 3 and 4, and the assumption of the
size of the maximum .ξ .

Algorithm 3: Description of the method for finding the certainty level of every
sample for .ξ Byzantine data items in every bin in the distribution.

4 Corruption of Existing Data, Multi-Feature Learning (with

a New Decision Trees Algorithm)

Our last contribution deals with the general cases in which corrupted data are part of
the data-set and can appear in two modes: (i) An entire feature is corrupted (Fig. 3),
and (ii) Part of the features in the data-set is corrupted and the other part is clean.
Note that there are several ways to corrupt an entire feature, including: (1) inverting
the classification of data items, (2) selection of random data items, and (3) producing
classifications inconsistent with the classifications of other non-corrupted features.
Our goal, once again, is to identify and to filter data items that are suspected to
be corrupted. The first case (i) is demonstrated by Fig. 3, where the raw data items
contain one feature and one vector of labels, where part of the features are totally
98 S. Dolev and G. Leshem

Fig. 3 Histogram of original samples with corrupted data inside the normal curve

non-corrupted and part are suspected to be corrupted (for all samples in this column
there is a wrong classification).
Method to Bound the Influence of the Corrupted Data Items Our technique is
based on the Random Forest; like the Random Forest algorithm [4] we use decision
trees, where each decision tree that is created depends on the value of a random
vector that represents a set of random columns chosen from the training data.
Large numbers of trees are generated to create a Random Forest. After this forest
is created, each instance from the training data set passes through these decision
trees. Whenever a data set instances arrives to a tree leaf, its tree classification is
compared with its class (.+1 or .−1); when the classification and the class agree the
right instance of the leaf is incremented; otherwise the value of the wrong instance
of this leaf is incremented, e.g., 351 instances were classified by Node 5 (leaf): 348
with the right classification and 3 with the wrong classification (Fig. 4).
Certainty Adjustment Due to Byzantine Data Bound The certainty level .ζ of
each leaf can be calculated based on the assumption that the upper bound on
the number of corrupted data items .ξ at every leaf in the tree is known. These
calculations are arrived at using equations 3 and 4, where, .L−1 is the number of
Purifying Data by Machine Learning with Certainty Levels 99

Fig. 4 Example of a decision tree for predicting the response for the instances in every leaf with
right or wrong classification

variables (in the leaf) that are labeled as .−1, .L+1 is the number of instances (in
the leaf) that are labeled as .+1, and N is the total number of variables that were
classified by the leaf.
In the second step, each instance from the test data set passes through these decision
trees to get its classification. Each new tested instance will get a classification result
and a confidence level, where the confidence level is in the terms of the (training)
right and wrong numbers associated with the leaf in the tree. The final classification
is a function of the vector of tuples .classif ication; right; wrong; with refer-
ence to a certainty level rather than a function of the vector of .classif ication
which is used in the original Random Forest technique. In this study we show one
possibility for using the vector of .classif ication; right; wrong; , though other
functions can be used as well to improve the final classification.
Algorithm 4: Description of the method for identifying and filtering Byzantine
data for multi-feature data-sets.
We tune down the certainty in each leaf using a given bound on the cor-
rupted/Byzantine data items. The contribution of this part includes a conceptual
improvement of the well known random forest technique; by re-examining all data
items in the data set. The re-examination counts the number of right and wrong
classifications in each leaf of the tree.
100 S. Dolev and G. Leshem

Algorithm 4 Identify and filter Byzantine data

1. First, select the number of trees to be generated, e.g. K,
2. For k=1 to K do
3. A vector .θk is generated, where .θk represents the data samples selected for creating the tree (e.g.,
random columns chosen from training data sets - these columns are usually selected iteratively
from the set of columns, with replacement between iterations),
4. Construct tree T (.θk ,y) by using the decision tree algorithm,
5. End for
6. Each instance from the training data passes through these decision trees, and for every leaf the
number of instances that are classified correctly (right) and incorrectly (wrong) are counted,
then the percentages of right and wrong classifications are calculated,
7. Each instance from the test data set passes through these decision trees and receives a
classification,
8. Each new instance will receive a result .classif ication; right; wrong; from trees in the
forest, right and wrong percentages from all the trees are summarized (e.g., sample 10 is
classified by Tree No. 1 at Node 5 as .+1 with 90.% (or 0.9) correctness and 10.% (or 0.1)
incorrectness, by Tree No. 2 at Node 12 as .+1 with 94.% (or 0.94) correctness and 6.% (or
0.06) incorrectness, where the total correctness of .+1 for this sample from both trees is 92.%
(or 0.92) and 8.% (or 0.08) for .−1). The final classification for each instance will be determined
according to the difference between the total correctness (right classifications) for .+1 to the
total incorrectness (wrong classifications) for .+1 that are summarized from all trees. .
. This is one option for using the right and wrong counters to determine the classification.

Experiments and Results In order to validate the proposed method experimen-

tally, we present results to demonstrate the qualities of our new approach and
reveal the differences when a data-set with and without corrupted data is processed.
The comparison between the data-set was done with the Multi-Feature algorithm
of Algorithm 4. Using the Matlab function T=treefit(X,y) the algorithm creates a
decision tree T for predicting response y as a function of predictor X. X is an n-by-
m matrix of predictor values. y is a vector of n response values (for classification).
Another experiment was run to determine the number of wrong samples (at every
leaf) the classification trees can handle. Thus, from the correctness and incorrectness
values of every leaf (which were found in step 5 in the method above) certain
numbers of samples will be subtracted (for correctness) or added (for incorrectness)
and vice versa (e.g., leaf 5 at tree in Fig. 4 for class .−1 was found with 348 samples
with correct classification and 3 samples with incorrect classification. If 8 Byzantine
samples are present in the leaf, the classification of the data will change to 340 with
correct classification and 11 with incorrect classification. The results are presented
in Table 2.
For the first experiment (Tests 1 and 2), we used the “Satimage” data-set which
is a well known data-set for classification of a satellite image. The original data for
this database was generated from data purchased from NASA by the Australian
Center for Remote Sensing, and contains 36 attributes and 6 classes (2296 for
training and 2000 for testing). For the second experiment (Tests 3 and 4), we
used an artificial data-set which was created by Matlab software with the function
“normrnd” (generates random numbers from the normal distribution with a mean
Purifying Data by Machine Learning with Certainty Levels 101

Table 2 Results of the Multi-Feature algorithm, on original and corrupted data

Accuracy Accuracy right
Test No. Data set majority vote and wrong
1 Setimage 88.1.% 92.25.%
2 Artificial data base on Setimage 88.7.% 92.4.%
3 Artificial data with corrupted data case 1 Low (less than 87.8.%
65.%)
4 Artificial data with corrupted data case 2 85.3.% 92.4.%
5 Setimage with 12 corrupted samples at every leaf 88.1.% 89.20.%

parameter .μ and a standard deviation parameter .σ )—this database is constructed by

uniformly chosen data items, and contains 36 features (2296 samples) and 2 classes
(.−1,.+1). For the third experiment (Tests 5 and 6), we used artificial data where
some features (the entire columns) are corrupted and the other features are not (case
1). For the fourth experiment (Tests 7 and 8), we used artificial data where part of the
data in the feature (column) are corrupted and some features (columns) are corrupted
(case 2). For Tests 9 and 10 we used the original “Satimage” data-set. To summarize,
we demonstrated that algorithm 4 significantly improve the classification process
with and without Byzantine data.

5 Conclusion and Future Work

In this work we present the development (the details of the experiment results appear
in [7] of three methods for dealing with corrupted data in different cases: The first
method considers Byzantine data items that were added to a given non-corrupted
data set. Batches of uniformly selected data items and Chernoff bound are used to
reveal the distribution parameters of the original data set. The adversary, knowing
our machine learning procedure, can choose, in the most malicious way on, up to
the 2.%. malicious data; Note, that there is no requirement for the additional noise
to come from distribution different than the data items distribution. We prove that
the use of uniformly chosen batches and the use of Chernoff bound reveals the
parameters of the non-Byzantine data items. We propose to use certainty level that
takes into account the bounded number of Byzantine data items that may influence
the classification. The third method is designed for the case of several features, some
of which are partly or entirely corrupted. We present an enhanced random forest
technique based on certainty level at the leaves. The enhanced random forest copes
well with corrupted data. We implemented a system and show that ours performs
significantly better than the original random forest both with and without corrupted
data sets; we are certain that it will be used in practice.
In the scope of distributed systems, such as sensor networks, the methods can
withstand malicious data received from a small portion of the sensors, and still
achieve meaningful and useful machine learning results.
102 S. Dolev and G. Leshem

References

1. Aslam, J., Decatur, S.: Specification and simulation of statistical query algorithms for efficiency
and noise tolerance. J. Comput. Syst. Sci. 56, 191–2087 (1998)
2. Auer, P.: Learning nested differences in the presence of malicious noise. Theor. Comput. Sci.
185(1), 159–175 (1997)
3. Auer, P., Cesa-Bianchi, N.: On-line learning with malicious noise and the closure algorithm.
Ann. Math. Artif. Intel. 23, 83–99 (1998)
4. Breiman, L.: Random forests, Statistics department. Technical report, University of California,
Berkeley (1999)
5. Cesa-Bianchi, N., Dichterman, E., Fischer, P., Shamir, E., Simon, U.H.: Ample-efficient
strategies for learning in the presence of noise. ACM 46(5), 684–719 (1999)
6. Decatur, S.: Statistical queries and faulty PAC oracles. In: Proc. Sixth Work. on Comp.
Learning Theory, pp. 262–268 (1993)
7. Dolev, S., Leshem, G., Yagel, R.: Purifying data by machine learning with certainty levels.
Technical Report August 2009, Dept. of Computer Science, Ben-Gurion University of the
Negev (TR-09-06)
8. Kearns, M., Li, M.: Learning in the presence of malicious errors. SIAM J. Comput. 22(4),
807–837 (1993)
9. Mansour, Y., Parnas, M.: Learning conjunctions with noise under product distributions. Inf.
Proc. Lett. 68(4), 189–196 (1998)
10. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
11. Quinlan, J.R.: Induction of decision trees. In: Machine Learning (1986)
12. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Burling-
ton (1993)
13. Servedio, A.R.: Smooth boosting and learning with malicious noise. J. Mach. Learn. Res. (4),
633–648 (2003)
14. Valiant, G.L.: A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984)
On Impact of Data Models
on Predictability Assessment of Time
Series

Sergey Frenkel

1 Introduction

Significant advances in the development of forecasting systems achieved in the

framework of Machine Learning (ML), primarily based on neural networks, do not
eliminate the difficulties of their use in very critical cases, for example, as it turned
out, when predicting the development of the COVID19 pandemic. Among the many
reasons, the main ones of which are, of course, related to the lack of knowledge
of the relevant subject areas, one can also point out that modern ML, in particular,
those that using mathematical models based on statistical data, often do not take into
account some mathematical properties of random data, which are used in specific
MO systems with prediction models.
In the last decade, work on learning theory for sequence recognition (or online
learning) has tried to avoid as far as possible the need to use assumptions about
the exact types of probability distributions of the data. However presently most
modern forecasting tools, primarily based on neural networks and Deep Learning,
still use parametric statistical models, often based on Gaussian distributions for
a priori estimates of the expected quality of a prediction. Assumptions about the
Gaussian distribution are also essential for the SMV (Support Vector Machine
[1]) algorithm [2]. It is shown that the accuracy of linear models can increase if
the feature is distributed similarly to the normal law. Deep neural networks study
predictive relationships using a number of non-linear layers to build intermediate
representations of features, encoding previously obtained information into a latent
variable zt , while the final forecast is created only using zt , without taking into
account the possible characteristics of the data that exist within certain probabilistic

S. Frenkel ()
Federal Research Center “Computer Science and Control” R£S, Moscow, Russia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 103
B. Goldengorin, S. Kuznetsov (eds.), Data Analysis and Optimization, Springer
Optimization and Its Applications 202,
https://doi.org/10.1007/978-3-031-31654-8_7
104 S. Frenkel

models, and the main hopes for getting a suitable result are placed on computing
power.
In a more general sense, in many modern tools based on neural networks (NN),
the formation of a forecast of data values at points in time in the future is performed
as a search for internal patterns and relationships without using any classical
mathematical procedures (Markov models or other probability theory models) or
underlying formal theories. Training is carried out on examples, and not on a
mathematical model, although using various mathematical tools, such as linear or
logistic regression, at intermediate stages – for example, algorithms based on search
trees and boosting (XGB, etc.) [3]. These circumstances are often supplemented by
the opacity of the transformation of the input data (i.e., the construction of features,
for example, in deep learning algorithms [4]), which makes it difficult to a priori
assess the impact of certain input data properties on the prediction efficacy. As the
analysis shows, this can lead to incorrect forecasts, to the inefficiency of the tools
and procedures of the forecasting subsystems as part of information and computing
systems.
Therefore, it is important to make a preliminary assessment of how efficient a
particular forecasting tool is from the point of view of the designer of a specific
software forecasting tool (“Prediction instrument” – PI). Hereinafter, PI refers to
software implementations of prediction algorithms.
Most research on network traffic forecasting has focused on classical statistical
methods that rely heavily on the use of past (“historical”) data, and the time spent
on obtaining a forecast depends largely on the computational complexity of the
algorithm relative to the volume of this history.
In particular, in recent years there has been a significant increase in the number
of studies using computational approaches, including machine learning methods,
to predict various activities in complex computing systems – from user response
on websites for various purposes, to traffic through it and/or workload prediction.
However, the existing literature mainly focuses on the analysis of algorithms
implemented in modern software products for solving specific problems, and there is
no comprehensive review to answer many important questions, such as the impact of
the different types of data models used and the specifics of the mathematical models
underlying the basis of the algorithms used [3].
At the same time, most models assume that the more observations are used,
the more accurate (better) the forecast. Therefore, the problem of high accuracy
is associated with the problem of computational costs, and hence the speed. In
cases where prediction time is critical, for example, for real-time network traffic
prediction problem, the task of prediction is to minimize computational costs as
much as possible while maintaining an acceptable level of accuracy (efficiency) of
the predictor.
One way to solve this problem is to take into account and use the properties
of mathematical time series models that model traffic and job flows, which can
significantly affect the prediction. One way to ensure that the abundance of modern
approaches and models is taken into account is to develop a general conceptual
model of the prediction problem, which is actually absent in the modern literature.
On Impact of Data Models on Predictability Assessment of Time Series 105

This paper offers an overview of the indicated mathematical models for predict-
ing the behavior of complex systems, primarily telecommunication systems.
Our main goal is to analyze the requirements for mathematical models of data
and systems with processes occurring in them, whose behavior is required to be
predicted using certain software tools, taking into account the availability of data on
the functioning and typical ways of user actions with prediction tools.
We propose a certain categorization of modern algorithms and forecasting
methods based on probabilistic models, primarily focusing on current progress in
machine learning methods.
The probabilistic data models used in modern approaches for various prediction
problems are considered and analyzed, and the conditions and requirements for
prediction models/algorithms are formulated in the framework of probabilistic
models for prediction, from the point of view of harmonization (coordination)
both (data and prediction algorithms) models. Various formal-logical (semantic,
ontological methods) [5] remain outside our field of vision.
A conceptual model of the process of selecting predictors for a specific data set
is proposed and substantiated, on the basis of which a system of recommendations
and a selection scheme can be built.
Various examples of the influence of the characteristics of random sequences
and processes on the accuracy of prediction are considered and analyzed, including
problems of predicting the sign of increments of random series and processes.
We consider from a unified standpoint various prediction models that are
traditionally considered within the framework of various sections of theoretical
computer science [6], probabilistic disciplines [6, 22], and Machine learning [3, 4].
Since there have been many reviews recently on the practical use of various
professional prediction software [3, 4, 9], this paper does not provide a detailed
overview, but only links to sources detailing the relevant software tools.

2 Description and Statement of the Prediction Problem

Although mathematical (probabilistic) models for predicting random sequences,

series and processes have a history of more than eighty years [7], intensive
research in this areas continue so far. This is due to the huge variety of tasks
that are faced when processing large amounts of data. For these tasks, different
criteria for prediction accuracy, different data models, different requirements for
the time (efficiency) of obtaining a forecast may be required. For example, for
many problems of predicting the values of financial indices, the criterion for the
minimum standard deviation of the predicted data from the true values (on which
the Kolmogorov-Wiener criterion and their numerous linear modifications is based
[7]) may well be suitable. But it is obviously not suitable, for example, when
it is required to predict the sign of the increment. In some problems, it is quite
reasonable to use as a criterion some values averaged over large past samples,
106 S. Frenkel

in others, local estimates are needed. There are prediction algorithms (predictors)
that use numerical (measured) data without any assumptions about their statistical
and probabilistic properties (like many algorithms that use the concept of a neural
network [3]), and there are implicitly assuming data as samples from a normal
stationary process, or as transitions of a Hidden Markov Net (i.e. generated by a
Markov source [8]).
In this regard, we fix the following approaches to the problem of prediction
present in the literature:
– as one of the mathematical problems of estimating unknown conditional proba-
bility distributions (“theoretic-probabilistic approach”),
– as one of the mathematical problem of choosing the optimal solution DM
(decision making) approach,
– as an algorithmic control problem.
In principle, the practical application of prediction models requires a combination
of these approaches, which in one form or another must be implemented in practice,
although in the literature these approaches exist for the most part isolated. In my
opinion, this is bad both for practice and for teaching students.
In the Probabilistic approach, the task of prediction is as follows:
Let at the time t, the conditional probabilities γ(xt + 1 |x1 ,x2 ,.,xt ) (or the probabil-
ity distribution density P(xt + 1 |x1 , . . . , xt )) are estimated for the implementation of
the process x1 ,x2 , . . . ,xt . In other words, estimates of how likely it is that at time
t + 1 we will see the value that we predict. It is clear that the more accurate (in the
accepted metric) the estimate, the better the forecast. At the next time t + 1, we have
to estimate the probabilities γ(xt + 2 |x1 . . . xt + 1 ) (or the density p(xt + 2 |x1 , . . . ,
xt + 1 )) and so on. Mathematical model estimates γ(xt + 1 |x1 ,x2 ,.,xt ), as well as the
conditional probability γ() itself, is called a “predictor” in the literature [14].
Along with γ(), predictors are also called software products (for example, some
modules of cloud MS AWS Azure ML, Google Cloud ML, etc.) that are used for
prediction.
In the future, these software products will be called as previously (Introduction),
“Prediction instrument”)PI) which may be a subsystem of intelligent decision
support (IDS), and the word “predictor” will be used in both senses with the
necessary explanations, if this does not directly follow from the context.
In the DM formulation, the following prediction problem is formulated, taking
into account the requirement for the efficiency (accuracy) of the forecast: for a
given space of strategies B, the space of possible predictable values X, and the loss
function l(b, x), b ∈ B, x ∈ X, choose the predicted next value xt = bt given the
knowledge of past states x1 , . . . ,xt − 1 (from X), so that the total the loss from the
prediction error (“loss”) would be asymptotically close to the loss obtained by the
best fixed strategy known a posteriori after looking at the entire sequence x1 , . . .
xt . – i.e. a well-chosen (predicted) bt should minimize possible a posteriori losses.
The strategy is a function that returns an output vector y for each input vector x
according to the conditional distribution function P(y|x) (or γ()) [14, 16].
On Impact of Data Models on Predictability Assessment of Time Series 107

The third statement of the problem, which we call the “algorithmic model” of
prediction, consists in using various conceptual models for obtaining a prediction
that do not explicitly contain any basic elements of mathematical models of the
above two mathematical formulations of the prediction problem. In terms of content,
conceptual models underlie the input language for describing the task of obtaining a
forecast (more details below), in particular, the parameters of called functions from
certain machine learning libraries [9, 17, 50], which can specify the structure of the
predictor used. For example, a search tree in eXtreme Gradient Boosting Regressor
aka XGBRegressor, if the user conceptual model considers the forecasting process
as a search for an acceptable (by the selected performance criterion) prediction
among a set of possible values in a fixed data area, or as a structure of a neural
network (number of layers, excitation functions etc.), for example, using the LSTM
model. XGB, from the point of view of the user of the prediction tool, is a gradient
boosting machine that reads some dataset, applies to it various methods of predicting
(not necessarily using any probabilistic models) the future values of the time series,
using an iterative procedure to adaptively change the samples from the training data,
trying different ways of predicting the results of evaluating misclassified records.
This can be seen as incremental addition of prediction methods (referred to as
“prediction models”) until further improvement is made. These “prediction models”
can be developed based on any of the above two classes of mathematical models,
but they are not present in the description (input language) of the algorithmic model
in any way – from the point of view of XGBRegressor, these are just more or less
successful predictors (oracles).
There are dozens of other algorithmic models that operate from the user’s point
of view as a solution (best prediction) search engine, for example, using Linear
Regression, Neural Networks;. Support Vector Machines (SVMs), Artificial Neural
Networks, etc.

3 Conceptual Model of the Prediction Problem

In software algorithmic prediction models, as in other machine learning systems,

algorithmic models are based on the use of experience gained as a result of
solving problems at the training stage. For example, an automatically trained binary
classifier is usually understood as a model that represents the characteristics of the
classes learned at the training stage [10, 11].
A more or less coherent description of such models can be made on the basis of
the idea of a conceptual model.
The Conceptual Model (CM) is a set of notions (“points of view”) about the
elements, objects (both real and model), and the goals of the modeled system, as
well as the relationships between them, expressed in terms of a particular theory.
From this point of view, the CM of the prediction algorithm should represent data
types and assumptions about their mathematical properties.
108 S. Frenkel

Considering that the object of our interest is probabilistic models of prediction,

the conceptual model (CM) in the paper means a (formalized or natural) description
of the relationships between elements, objects (both real and model) and the goals of
mathematical models and algorithms predictions, as well as the relationship between
them expressed in probabilistic terms.
Note that in the framework of modern software development methodologies
(Software Engineering), a conceptual model is understood as a formalized descrip-
tion of project requirements in one or another data modeling language (for example,
UML) [12], and the corresponding language allows a certain typing and structuring
of these concepts.
However, this is necessary only if there are information-logical relations between
the elements of the system being developed, and if they are not taken into account,
it can lead to conflicts during the functioning of the system. An example is the
requirement to comply with the temporal order of occurrence of certain events [13].
Since the prediction problem as such (without details of its software imple-
mentation) does not require such a description, we will consider textual informal
KMs.
The main requirements for a conceptual and mathematical model are that they
must provide a representation of patterns and trends in the data that are important to
the user, in which case it will be a predictive model for current data to predict what
will happen next. In this case, the model should allow the evaluation of actions.
Steps to be taken to obtain optimal results.
Therefore, CM should represent the upper level of the description of a software
product that performs the prediction of the “future” from known data, which we
call the Predictor. Within the framework of the ML paradigm, we consider that
the predictor is being trained, and then, based on his training, forms a forecast.
Training is on some sequence, and the goal is to tune the parameters of the model
implemented by this program to predict the future part of the (not yet observed)
sequence. In practice, this may consist not only in tuning the parameters of one
particular model, but also in choosing a set, predictor, and/or training samples from
a certain region, and the goal is to predict new sequences from the same region as
accurately as possible.
The usual approach for this scenario is to develop a theoretical model of the
real process (random sequence or process) consisting of identified process steps and
possible state transitions with their transition probabilities.

3.1 Requirement for Consistency of CM Components

In order to have a tool for describing the options for using these approaches to
solving prediction problems, we define a conceptual model of prediction algorithms,
which must be consistent with a mathematical (probabilistic!) model, i.e. provide a
On Impact of Data Models on Predictability Assessment of Time Series 109

representation of all the necessary elements of the subject area of a given prediction
problem in a mathematical model of the predictor and it allows a priori assessment
of its effectiveness (one or another measure of prediction error, for example). An
example (and source) of possible “inconsistency” for the binary sequence predictor
model is the fact that according to [14]:
for any predictor, there is a stationary and ergodic source such that the error
|P(xt + 1 |x1 . . . xt ) – P (xt + 1 |x1 . . . xt )| does not go to 0 when the length of the
observed sequence t goes to infinity (Here P() and P ( ) are the true probabilistic
distribution and its estimate respectively).
This means that for any binary sequence {xt } that models this or that data, for any
predictor of binary data, it may turn out that the predictor chosen for prediction does
not guarantee that the error in estimating the probability distribution of the predicted
value x *t + 1 with respect to the distribution of the true value xt + 1 converges to 0
as the length of the training sequence increases.
In this sense, this predictor is inconsistent with the data model as a random
sequence {xt } (say, Markovian), and to overcome this inconsistency, the description
of the QM must include the concepts of describing the probabilistic model that
explicitly or implicitly underlies the software implementation of the predictor. This
means that, at a minimum, one should always have a set of predictors, in the hope
that among them there will be one for which the specified property does not hold
(the difference between the specified distributions will converge to zero relatively
quickly), and the conceptual model should include a description of the choice
(enumeration) predictors.
Another example of possible inconsistency between data models and predictor
mathematical models is the presence of a non-linear relationship between past and
future values among the data (for example, self-similarity [15]). If this circumstance,
i.e., a possible mathematical model of the data, is not represented in the conceptual
model of the proposed prediction method, then there is an obvious uncertainty in
the choice of an effective algorithm used in the available toolbox.
So, the conceptual model in our understanding should be the top level description
of a software product that performs the prediction of the “future” from known data,
which we call the Predictor. Within the framework of the ML paradigm, we consider
that the predictor is being trained, and then, based on his training, forms a forecast.
Training is on some sequence, and the goal is to tune the parameters of the model
implemented by this program to predict the future part of the (not yet observed)
sequence. In practice, this may consist not only in tuning the parameters of one
particular model, but also in choosing a set, predictor, and/or training samples from
a certain region, and the goal is to predict new sequences from the same region as
accurately as possible.
110 S. Frenkel

3.2 Generalized Representation of the Conceptual Model

of the Prediction Problem

We will consider the predictor p of values of the sequence x1 , . . . ,xt as a function

bt + 1 = fp (xt-m , . . . ,xt ), which calculates the value of the random sequence
predicted by the predictor fp at time t, m is the number of terms of the sequence
preceding by the time t, which is used in this predictor p to obtain a prediction
(bt + 1 means exactly the estimate of the true value of xt + 1 and may not coincide
with it).
A sequence of random data xt-m , . . . ,xt (or a segment of the implementation of a
continuous process) can be considered either as a training sample (its part) and / or
as input data for some regression model (for example, ARIMA), i.e. as a regression
xt + 1 over xt-M , . . . , xt , where M ≤ m, and for M < m values between xt-m and xt-M
are used for the training.
Functions fp belong to one or another family of algorithms (for example, neural
networks, gradient, etc.).
We will call the prediction is “probabilistic” if the values at the next moment
of time are predicted only with a certain probability (Note that the prediction
algorithm, i.e. the method of calculating from known (observed) values) can be
either deterministic or probabilistic).
With that said, the conceptual model is presented as:

CM = Mθ D (S), ML , FLM (1)

where Mθ D (S) is a prediction model defined by the data type D (binary, real), with
a probability measure θ on D, and a structure S on D, which refers to the way data
of type D is structured as predictor input variables (for example, splitting data for
training and test choices, or scalar or vector data, etc. [9, 11, 50]),
ML is a model of loss (“PENALTY”) from the received prediction with loss
function L(XS ,YS ), where XS , YS ∈ D observed (XS ) and YS predicted data with
structure S given by some ratio on D × D (e.g., “success rate”, the ratio of the
number of successful predictions to the total number of predictions), related with a
measure of predictability [8, 14, 21] and prediction efficiency,
FLM = {f1 ,..fm } is a set of predictors based on the models Mθ D (S) and ML –
consisting of predictors in the sense defined above, i.e. methods for estimating the
future value with the conditional probability distribution of the predicted value.
We assume that the predictor fp always corresponds to an admissible prediction
model, i.e. all its input data and parameters can be uniquely determined within
the framework of the model under consideration at the moment, and therefore are
consistent in the sense indicated above (as it was introduced in the Sect. 3.1 for the
concept of CM).
On Impact of Data Models on Predictability Assessment of Time Series 111

This construction (CM) allows representing classes of models according to the

types of predicted data, used probability measures, loss functions when making
decisions about the acceptability of the received forecast.
Example 1
Let θ be the probability Bernoulli measure on D = {0,1}n – the set of sequences of
length n = 1,2 . . . – the number that determines the length of the sequences under
consideration. From the point of view (2), n is a parameter of the structure S, for the
sequence Xt = {x1 , x2 „ xi , . . . , xt-1 , xt| }∈D. Probability P (xi = 1) = θ.
For different structures, a model with a given probability measure will have its
own peculiarity (see reasoning about [19] in the 4.1).
Let the loss function is “0/1loss”, that is a measure of the loss of the predictor
user from prediction errors of the form “zero is predicted instead of the true value”
and vice versa, equal to the proportion of incorrect predictions for a sequence of
length n, those:

en = Ep i=1,n I(bt = xt )/n ,

where I() – indicator function, bt , xt – predicted and true value, respectively, Ep

means distribution averaging with parameter .
One can consider the following predictor bt = f(xt-1 ), where f ∈ FLM and the loss
function is defined, known [8] as “the optimal rule for predicting the criterion of
minimum loss in measure en ”:

bt = 1 if Prob(1) > 1/2 (2)

bt = 0 if Prob(1) < 1/2

and in the case, Prob(0) = Prob(1) = 1/2, the prediction is not performed.
In this case, as it is easy to see, the average losses are equal to the error probability
1-, when > 1/2, and , if 0 is considered “success”, which can be expressed as:

Lp = min(, 1 − ) (3)

Instead of a single predictor, we can consider the actions of some subset of FML
predictors.
In this case, the conceptual model for binary data (D = {1,0}n ) can generate the
well-known “multi-expert” prediction scheme, in which binary prediction is viewed
as a game between the predictor and the environment, and predictors (whose goal is
to minimize expertise loss) that fulfills the prediction.
Each expert F is a sequence of functions Ft : {0,1}t-1 ➔ {0,1}, t ≥ 1, i.e. expertise
is a way of setting a probability distribution on a set of sequences {0,1}t-1 . Each
expert defines a forecast strategy as follows: when observing with the first t-1 bits
y1 ,. yt-1 expert F predicts that the next bit of y is 1 with probability Ft (yt-1 ) [16].
112 S. Frenkel

It is easy to see that the content of the conceptual model of this approach does
not differ from the previously considered view of prediction as an estimate of the
conditional distributions of certain events associated with a random binary sequence
(with possible differences in the prediction efficiency estimates used).
Let us now consider what requirements should be met by mathematical models of
data in the specific prediction problems, for example, traffic in telecommunication
networks, so that it is possible to ensure and control the prediction efficiency
in a given cycle (at a given step) of prediction, i.e. to ensure consistency with
the properties of the input data and the criterion making a decision about the
effectiveness of the considered algorithmic prediction model.

4 Requirements for Mathematical Models of Prediction

As it is easy to see from (1), the first issue that should be taken into account
when choosing/developing mathematical models of predictors is the probabilistic
measure used in a particular problem, and hence the probabilistic distribution model
considered on the data set D, and the measure of prediction accuracy determined by
the measure ML (2).
An obvious criterion for forecast accuracy is the distribution of the probability of
prediction error (which determines the risk of incorrect prediction), which obviously
depends on the probabilistic data model, and, formally, its assessment should be
based on knowledge of the probability distributions of the data, and models of
the relationship between the distribution of data and the probability of the correct
forecast. But usually these distributions are unknown.
From a formal point of view, the unknown distribution of the predicted data
means impossibility of a priori estimation of the accuracy (uncertainty) of the
prediction with a given prediction algorithm leads either to the requirements for its
assessment in the prediction process, or to the use of Bayesian methods [17] of the
distribution of the desired conditional probability p(y|x) predicting x from known
data y, which is also associated with technical difficulties.
This indicates the need for a certain theoretical model that allows traffic
prediction problems to be based not on specific knowledge of distributions, but on
knowledge. About general properties of distributions of certain class. However, the
ability to overcome this problem depends on the data type D.

4.1 About Binary Data Prediction

This class of problems can be of interest both in itself and in solving problems of
predicting the sign of increments of discrete or continuous random processes (see
Sect. 5).
On Impact of Data Models on Predictability Assessment of Time Series 113

So, for discrete data from some finite alphabet A, one of the theoretical
approaches to overcome the problem of the lack of exact knowledge of the data
distribution is to use the concept of an individual sequence, which is considered as
existing in a single instance, and not as a sample trajectory from the ensemble, as
is customary in theory. Random processes. It is believed that it is generated by a
random source, usually a stationary ergodic one.
A natural question is: can one predict the next values of an individual sequence
by estimating a probability distribution based on the past, for example, a specified
proportion of correct predictions up to time t, and then minimizing the expected
loss in this assignment (i.e., acting as if future events actually happened with
an estimated probability). The answer is yes, if you use the so-called. “universal
scheme” predictor [14].
A well-known and used example of a universal measure for symbolic
(binary, in particular) sequences, which expresses the conditional probability
Prob(xt + 1 |x1 , . . . xt ) and is built on the basis of a universal code U, the redundancy
of which is asymptotically minimal for classes of Bernoulli and Markov sources
[14] (the most famous example is the code obtained as a result of applying the
Lempel-Ziv (LZ) compression algorithms [14]).
According to the well-known result [14], let U be a universal code for some set
of sources generating letters from the alphabet A, and the measure μU for each
word v in the alphabet A is given by the equality.
As shown in [14], under the assumption of equiprobable generation of binary
strings from the set U by some (hypothetical) data source, the conditional probabil-
ity Prob(xt + 1 |x1 , . . . xt ) can be expressed:

μU (v) = 2−|v| / 2−|U(u)|
u∈Av

where v is the considered string x1 , . . . ,xt from a given set of binary strings in
a given universal code (e.g., LZ), which provides compression as much as their
entropy allows, Av denotes the number of strings A|v| .
Then the measure μU is universal on , i.e., predicting the next value of any
sequence from the set given by sources from , about which there is reason to
assume that this is a Markov sequence, can be predicted with some minimum error
without knowledge the exact distribution parameters, as:

Prob (xt+1 |x1 , . . . xt ) = μU (x1 , . . . , xt−1 , xt ) /μU (x1 , . . . , xt−1 ) .

Note that the above LZ algorithm [14], which performs lossless character sequence
compression, can at the same time act as a predictor by determining xt + 1 as the
corresponding leaf in the partial match tree [14] with the conditional probability
induced by the stepwise algorithm parsing (Recall that one of the well-known
criteria for the randomness of a binary sequence is the possibility of its compression
(to a value corresponding to the unit of entropy per symbol, which is the randomness
criterion [18])).
114 S. Frenkel

Let us now clarify the issue of prediction accuracy.

Since the predictor is considered as a conditional distribution, the question of
using a universal predictor is reduced to estimating the closeness of the estimated
distribution P(xt + 1 = a|x1 ,..xt ) and the conditional distribution γ known at time t
(by the predictor) at time t + 1. It is known [14] that an effective measure of the
closeness of two distributions is the Kullback-Leibler measure:

KL Pγ =a∈A P (xt+1 = a|x1 , x2 , ., xt ) log (P (xt+1 = a|x1 , x2 , ., xt ) /

γ (xt+1 |x1 , x2 , ., xt )) .

where KL is the divergence measure of the Kullback-Leibler (KL) distributions.

A more practical and simpler example of a universal predictor is the Laplace
Predictor L(xt + 1 = a| x1 . . . xt ).
It estimates the conditional probabilities P(xt + 1 = a ∈ A| x1 , x2 . . . , xt ) from
the known values of x1 , . . . , xt like:

γ (xt = a|x1 , . . . xt−1 ) = nx1,...xt−1 + 1 / (|t| + |A|) .

where nx1, . . . xt-1 is the number of occurrences of the letter a in the subsequences
x1 , . . . xt-1.
If A = {0,1}:

γ (xt = 1|x1 , . . . xt−1 ) = (n1 + 1) / (M + 2) ,

γ (xt = 0|x1 , . . . xt−1 ) = (n0 + 1) / (M + 2) .

n1 , n0 are the number of ones and zeros in the sample of the size |t| = M.
It is important to note that the predicted probabilities cannot be equal to zero
even through a certain letter did not occur in the word x1, . . . , xt − 1, xt .
For example, for the sequence 01010, the Laplace predictor:

γ (1|01010) = 3/7 for A = {0, 1} .

It was shown in [9] that for any source with probabilistic distribution P that generates
independent and identically distributed symbols from the alphabet A, Laplace
predictor error satisfies the inequality:

KL (P, L) ≤ log (e (|A| − 1) / (t + 1)) .

(e = 2.718 . . . is the Euler number) .

On Impact of Data Models on Predictability Assessment of Time Series 115

As it is not hard to see, Laplace predictor is universal as its considers prediction as

a set of estimations of unknown (conditional) probabilities, and the average error
of the Laplace predictor (estimated either by the KL divergence or the variation
distance) goes to zero for any unknown i.i.d. source, when the sample size t grows.
Moreover, it can be easily shown that the error (and the corresponding variation
distance) goes to zero with probability 1, when t goes to infinity. Obviously, such
a property is very desirable for any predictor and for larger classes of sources, like
Markov, stationary and ergodic, etc.
We see that the error of the Laplace predictor tends to zero for any source that
generates independent and equally distributed symbols (i.e. does not depend on the
probability distributions of symbols (the probability of occurrence of symbols), but
requires certain (theoretically) probabilistically-conditioned constraints).
Unfortunately, there is no predictor that has this property for any stationary
ergodic source [6, 8]). But the universal predictor γ, for any stationary ergodic
source ω generating letters from some finite alphabet A, ensures that another
measure of error tends to zero, namely, the Cesàro mean of errors:

limt→∞ t=1,s KL(ω, γ)t /t = 0.

In other words, we can talk about the universality according to Cesare with such a
measure of error, the predictor γ is universal for the set of sources if for it this
expression = 0 for any ω ∈ . That is, if we know to which of the wide class of
(stationary ergodic) sources (generating data) can belong data, then we can, given
this equality to conclude that for any source ω ∈ , the value of the predictor γ(x1 ,
. . . ,xt ) approaches the probability ω(x1 . . . xt ), which means that the predictor is
universal (in the sense of averaging over the divergence KL on the interval of length
t).
From this equality, we can conclude that, in a certain sense, the universal measure
μ is a nonparametric estimate for an unknown conditional probability distribution
P.
Thus, if the KL-Cesaro estimate is relevant to the prediction problem under
consideration (for example, there is a monotonic relationship between γ() and some
predictive quality criterion, for example, in terms of the loss function), then θ in
the representation {Mθ D (S), ML , FLM }, we can consider as an arbitrary stationary
ergodic measure.
As it shown in [14], whatever the actual data generation mechanism, using a
universal approach (more precisely, estimating the quality of a prediction based on
it) does not perform much worse than any other possible forecasting method that
uses knowledge of the probability distributions of data.
Thus, using the results of the universal predictor theory, one can abstract from
the parameters of specific distributions in the course of prediction.
To better understand the implications of this result, let’s ask, does additional
information about the probabilistic properties of the data always improve the
prediction score?
116 S. Frenkel

A negative answer follows, for example, from the results of [19]. In [19]
was proved that for any finite i.i.d. a sequence of binary data in which each
outcome “success “or failure” (0 or 1) the conditional expectation of the proportion
of successes among the results that immediately follow a series of consecutive
successes will be strictly less than the corresponding conditional probability of
success (from which one determines expectation – depends on the occurrence of
at least one series of k consecutive successes within the first n – 1 trials, where n > 3
and 1 < k < n – 1).
Thus, the choice of the structure S, so that it includes k previous values of the
binary sequence, in this case, can lead to a worse prediction. It all depends on the
prediction model.
Correspondingly, attempts to use knowledge about the probabilistic properties of
binary sequences do not always lead to better forecasts, and the resulting forecast
depends on the accepted criterion (“optimal Bernoulli”, in the considered case (2)).
This suggests that additional information does not always lead to an increase
in the a priori probability of a correct forecast; here we are talking about a priori
probability, because explicitly a posteriori probability when choosing a prediction
method (or predictor) is not present – it can only be estimated indirectly, based on
the results of past work.
Let us consider how criteria are manipulated in the modern practice of MO-based
predictions to ensure independence from the knowledge of distributions.

4.2 Performance Criterion Not Related to Data Distributions

As the analysis of the literature shows, another way to avoid the need to evaluate
the exact laws of distributions is to use the criterion of empirical risk minimization
(ERM) [20, 51].
The main idea behind ERM is that we cannot know the true “risk” when running
on a particular dataset because we don’t know the true distribution of the data the
algorithm will operate on, but instead we can measure its performance on an already
known dataset training data (“empirical” risk), and minimize it. Cloud computing
traffic prediction papers [20] show that the principle of risk minimization used by
time series prediction algorithms affects the accuracy of algorithms in different ways
in environments with different traffic models. ERM is rated as:

bt = argh∈H min Remp (h)

Remp (h) = i=1,n L h (x) , x /n

where L(h(x),x’) is a loss function that measures how much the prediction value
x’ = bt under hypothesis h∈H differs from the true value x = xt .
On Impact of Data Models on Predictability Assessment of Time Series 117

In other words, Remp is an estimate of the average loss calculated from n previous
observations. As shown in [21] the proportion of correct predictions averaged over
the number of observations is the estimate of the predictor γ(), and (as it is easy to
show) this estimate coincides with Remp for the specified “0/1loss” with Hamming-
type loss function L(a, b):

L (a, b) = 1, if a = b, and L (a, b) = 0 in the case of a = b,

where a, b are true and predicted values respectively.

However, no assumptions are made about the distribution of the predicted data.
So, for symbolic/binary random sequences that could be considered as stationary
ergodic, there are no characteristics other than ordinary frequencies (Laplace,
Bernoulli optimal predictor (2)) that significantly affect the prediction efficiency.

4.3 Influence of Probabilistic Properties of Random Time

Series on Prediction Efficiency

The concept of probabilistic causality, introduced in the context of random processes

homogeneous in time, can be used to determine the similarity relation on stochastic
processes.

Examples of Concepts in the Theory of Random Processes with Real Values

Affecting the Efficiency of Prediction

Let we have two random processes X(t) and Y(t) defined on an ordered set real
domain R with associated probability functions P and Q on the same result set. We
say that the two processes are causally similar [52] if:

P (x (t) |x(t − a)) = Q (x (t) |x(t − a)) , ∀t and ∀a > 0.

where “|” as previously means conditional probability.

It is obvious that all processes that are homogeneous in time and transferred in
time are causally determined.
Intuitively, this is true for a closed physical system, but not for an open system,
since in this second case other external variables may influence dynamic evolution.
If two processes are homogeneous in time, Markovian and jointly of the same
transition matrix, it is also easy to show that they are causally similar.
Finally, if two Markov processes (not necessarily time-homogeneous) share the
same transition matrix at the same time step, then they are also causally similar.
From the point of view of the prediction problem, this means that the conditional
distribution of the predictor γ () for a particular data set can be determined by
different mathematical models, which will be considered below.
118 S. Frenkel

Other important concepts in prediction theory are dependency models of past

and future values of data (linear or non-linear models, correlation structures) and
stationarity of data in the sense that some (usually hypothetical) laws governing the
change in data over time of observation (and associated change in events) remain
unchanged, at least during the collection of the history, on which the forecast must
be made.
For example, for highly correlated time series (including the case of binary
sequences), linear regression models (ARIMA, such as [3]) give a better prediction
than if they are weakly correlated.
In other words, if, when considering the model properties of the time series,
we restrict ourselves only to the correlation properties of the sequence, and
the conditional forecast probability γ(xt + 1 |x1 ,x2 ,.,xt ) considered above is clearly
related to correlation (as is the case, for example, for Gaussian distributions,
when correlation is equivalent to independence), then according to the degree of
correlation we can expect one or another quality of prediction by a linear regression
predictor.
However, as a study of the literature has shown, many examples of data, for
example, telecommunication network traffic, considered as a random sequence
(time series), in general, have more complex and subtle properties from the point of
view of the theory of random processes, and, accordingly, they have more complex
of prediction models.
For example, the traffic values at the predicted moment may depend on events
significantly remote in time (LRD – Long-range dependence, see below) and it is
natural to assume that LRD traffic prediction will be quite effective (since informa-
tion about the past clearly affects the predicted future). Moreover, the trajectories
(realizations) random processes with such properties have “self-similarity”, i.e.
repeatability of patterns on different time scales [23].

4.4 Self-Similarity as a Nonlinear Relationship Between

the Past and the Future

In fact, the self-similarity property reflects the nonlinear relationship between the
past and the future.
The fact is that, for example, the self-similar behavior of traffic, and its long-term
correlation, although used as the basis for forecasting] [, but on the condition that
the predictor should not respond to short-term traffic changes. However, there are
situations where this short-term data is important, for example, at the beginning of
a DDoS attack [24].
At the same time, self-similarity means a nonlinear relationship between the
present and the future, since the repeatability of a form on different scales can
not be expressed by a linear transformation, since the transformation in time is
associated with the appearance of new harmonics in the spectral representation of
corresponding random process.
On Impact of Data Models on Predictability Assessment of Time Series 119

If in the discrete spectrum of a random process (time series) x(t) there is only
one spectral frequency harmonic ω, and at some future moment x(t + k) 2ω appears
(a “faster” component), then this can only be expressed with using non-linear
transformation of values x(t + k − m), . . . x (t + k − 1), m < k. This circumstance
can also make it inefficient to use a very large volume of observations.
From a more general point of view, self-similarity is a term from fractal theory
[25], which describes objects that visually look the same regardless of scale, which
is expressed in the fact that local signal patterns are repeated many times in a
whole time series on different (usually small) time scales, so the original set can
be reproduced from its smaller portion at suitable magnification.(Intuitively, this
property means the presence of a past-future connection structure that, from an
intuitive point of view, could increase the possibility of correct prediction). For
example, in a number of methods [24], self-similarity destruction is a predictive
feature – the methods are based on the observation that the presence of a DDoS
attack reduces the degree of self-similarity of normal traffic, since DDoS tools do
not generate self-similar traffic, and this is reflected in the traffic.
In the practice of telecommunications networks, self-similarity can be caused
by so-called “Elephant” connections [27], generating a continuous stream with
an extremely large total size of bytes created by, for example, a TCP stream or
other network channel protocols. These streams can use extremal share of the total
bandwidth over a period of time. Thus, it is they who will determine the state of
traffic during this period of time (say, the volume of transmitted packets per second,
or packet delay), and it is this period that affects the structure of traffic in this time.
Note that self-similarity is also called scale invariant under the transformation
x = bx, y = ay, if a curve F(x) is scale invariant under the transformation [28, 45]:

F (bx) = aF (x) ≡ bH F (x) .

where the exponent H = log(a)/log(b) is called the Hurst exponent (also known as
Hurst parameter or simply H). The Hurst exponent H determines the time separating
correlated (for example, stronger than a certain threshold value) of random process
(e.g., traffic) samples from each other (more details below).
Hurst exponent:

E (R (n) /S (n)) = CnH , n → ∞.

R(n) is the range of the first n cumulative deviations from the mean S(n) is the series
(sum) of the first n standard deviations E(n) is the expected value n is the time span
of the observation (number of data points in a time series), C is a constant.
Practically, in nature, there is no limit to time, and thus H is non-deterministic
as it may only be estimated based on the observed data. The calculation of H is
considered in detail in [28].
This value shows the degree of presence of self-similarity in the time series, since
its value is the greater, the smaller the time shift between the same pairs of values in
this time series.
120 S. Frenkel

(The greater the delay between two identical pairs of values in the time series,
the smaller the Hurst coefficient.)
The presence of self-similarity properties raises the question of the possibility of
using it to increase the efficiency of predictors.
Therefore, let us briefly consider well-known approaches to mathematical mod-
els of time series with self-similarity.
1. Poisson models
Speaking about the possibility of using the Poisson flow model as a means of
modeling self-similarity, we note that in [29] it is shown that, for example, modeling
network traffic with the Poisson model, assuming that the packet length will tend
to smooth out by averaging over a long time scale, may be incorrect, precisely
because network traffic exhibits a long-term dependence. At the same time, uniform
sampling from heavy-tailed distributions can produce poor estimates, since a
relatively small number of samples can seriously affect the final estimates.
The slow decay of the variance of a random process (e.g., in the number of
arriving packets) as the scale of self-similar traffic increases is in stark contrast to the
mathematical structure provided by the Poisson simulation, in which the variance
of the arrival process decays as the square root of the scale measure.
Therefore, less traditional models and properties of random process consider for
the series with self-similarity.
2. Long Range Distance model
If a random process, for example, network traffic, can have the property of a long-
range (extended in time) dependence [23], in which the correlation is preserved
between events sufficiently remote in time corresponding to changes, then one
speaks of a Long Range Distance (LRD) model.
From the point of view of correlations, the time series Xt , t ∈ Z, is called long-
range dependent if its covariance function [25, 26, 27, 28]:

γ (t) = E (X0 − EX0 ) (Xt − EXt ) ∼ c | t | 2−2H , t → ∞,

H ∈[1/2,1] – the Hurst exponent.

In spectral terms, the presence of LRD properties can be represented as:

f(O)a c|O|2H-1 при OÆ0

где f(O) is the spectral density Xt,

c > 0 is a constant.

The larger H, the stronger the time dependence, because the covariance function
decays more slowly at infinity (i.e., the correlation of events separated in time is
On Impact of Data Models on Predictability Assessment of Time Series 121

preserved at large H), i.e. the decay of the values of the autocorrelation function
is much slower than the exponential convergence typical of (short-range) classical
models such as autoregressive moving averages (i.e. if the data is represented by a
process with autocorrelation properties).
In the case of processes without LRD, there is no dependence (H = 0.5) and its
autocorrelation rk = 0 for the lags k ≥ 1.
The time series describing the traffic becomes self-similar due to the simultane-
ous influence of many sources (for example, as in the above [24]), and the aggregate
multi-level source traffic (throughput traffic) with a distribution of latency with
heavy tails for the time interval in which the source is active or inactive, can be
approximated by Fractional Brownian Motion (FBM) as shown in [25, 27, 45].
Therefore, FBM is considered as a natural tool for modeling the phenomenon of
self-similarity.
3. Fractional Brownian motion model
Fractional Brownian Motion (FBM) with the Hurst parameter H is a continuous
Gaussian process with an autocorrelation function. At the same time, for H > ½,
which, formally, means the LRD-property. But in contrast to the classical Brownian
motion, the increments FBM (of the difference process) do not have to be indepen-
dent, and knowledge of the law of this dependence (for example, the conditional
distributions indicated above) in some cases could improve the prediction (as
prediction of a process with known prediction) with continuous time on [0, T] that
has zero (conditional!) expectation for all t in [0, T] (see the possible use of this
property for prediction in Sect. 5 of this paper).
Note that in terms of using process properties for prediction, a Brownian Motion
(BM) without “fractionality” is a motion in which the process value changes with
random increments over time, and the prediction is determined by this randomness.
At the same in this process there is a certain “memory”, which means the
dependence between the past and the future.
From a formal point of view, BM is the integral of white noise. These motions
define paths that are random but (statistically) self-similar, i.e. the approximate
trajectory (section) of the particle’s motion (“outlining” the implementation of a
random process) resembles the entire path, and this is an intuitively understandable
possibility of prediction. But this requires a model that contains one or another
description of the structure of fractals, and not just a probabilistic model of
increments.
Section 5 will show how the measurable properties of the autocorrelation
function can be used to make a forecast.
So, the Hurst exponent can clearly indicate whether the time series has the
property of a pure random walk, or has some correlation structure [22, 25, 29]. For
example, in the field of Internet of Things [31] the prediction model may include
a parameter for network traffic as a priori knowledge. Since the self-similarity
property is well interpreted in terms of describing traffic by IT specialists, its use
in conjunction with the features of the deep network increases the interpretability
122 S. Frenkel

of the model, namely, the ability to understand how data properties, in particular
self-similarity, affect the quality of the forecast.

4.5 Stationarity as a Property of a Random Process Affecting

Predictability

Let us now consider another important property of the data modeled by a random
process, namely, the stationarity.
It is intuitively clear that the stationarity of a random process in the narrow
sense is more favorable for forecasting than non-stationarity, if only because the
joint probability distribution W(xt + 1, x1 , . . . xt ), which determines the conditional
probability of the predictor γ(xt + 1| x1 , . . . xt ), does not depend on time.
Most of the considered real processes, however, are not strictly speaking
stationary.
For example, this in most cases concerns network traffic, where non-stationarity
can be associated with its hopping over a wide range of time scales, and, many
publications show [25, 27] that network traffic does not in principle have stationary
behavior, also due to the presence of time cycles (daily, weekly) and can be
easily affected by network reconfigurations (for example, manual reconfiguration,
dynamic routing changes), communication failures and the deployment of new
machines and applications.
Obviously, for similar reasons, time series describing other real processes, such
as electricity consumption, etc., can also behave.
In other words, one can speak confidently about stationarity only to a certain
extent.
To estimate this “confidence degree”, statistical tests for stationarity are well
known [32], however, it is difficult to include their results in the prediction model,
since apart from intuitive qualitative arguments that stationary processes are better
predicted than non-stationary ones, nothing can be used in the model.
In [31], predictors are compared based on short-term correlations, which is
typical for problems of estimation, classification, and prediction of non-stationary
processes, and it is investigated whether it is useful to include a long-range depen-
dence in the prediction model, which, as noted, is associated with the self-similarity
property. The conclusion is that, first of all, short-term correlations dominate the
contribution to predictor performance, i.e. time to calculate satisfactory, in terms of
the applied criterion, predicted values. As a consequence, linear prediction with a
relatively short correlation structure is sufficient for prediction applications, rather
than the long-term correlations that exist between the future value of the time series
(e.g., traffic level and remembered history)). This is the case, for example, when
there are very many short connections, sometimes called ‘mice’ flows [27].
In [28] it is shown that relatively short observations can be considered as
stationary, at least in variance, because under self-similarity, the variance of the
On Impact of Data Models on Predictability Assessment of Time Series 123

sample mean decreases more slowly than the reciprocal of the sample X(m) size m
(slowly decay variances) Var(X(m) ) ~ a2 m-β with () < β < 1, a2 is a positive constant.
It is also significant that autocorrelations decay as hyperbolic functions, and not
exponentially fast.
Accordingly, with a certain degree of conventionality, one can also speak of
stationarity in terms of autocorrelation functions.
Hurst Exponent as a Measure of Stationarity In [31, 33, 34] it is shown that the
Hurst exponent H can act as a stationarity criterion for the self-similar process.
Let us point out its connection with the possible conventional characteristics of
mathematical statistics and the theory of random processes, and, accordingly, with
predictability.
Values H > 1 indicate non-stationarity.
For a stationary self-similar process H∈ (0.5.1). The closer the value of the Hurst
parameter is to 1, the slower the dispersion decays as the time scale increases, and
the traffic is said to become more pulsating, and therefore non-stationary.
Let us consider how these properties of LRD (self-similar) series affect the
technique for solving various prediction problems. If the task is to predict the
following values according to the root-mean-square criterion, and there is reason
to consider the series stationary according to H), then it makes sense to consider
(compare with) the Kolmogorov-Wiener optimal prediction criterion.
First of all, it is known that the root-mean-square error of linear prediction for
the simplest known linear forecast for a stationary time series x(t) is [35]:

ε = 1 − ρ2 var(x) ,

where the correlation coefficient is ρ = ACF(1),

ACF(1) means the autocorrelation function in the lag 1,
where the error corresponds to the minimum mean square error (Mean Square
Error-MSE):

σ2 = E(x (t + m) − Xe )2

where Xe is the predicted value, for example, by the Kolmogorov linear predictor.

Xe = i=1,n ai x (t − i) ,

coefficients ai are the objects of the Kolmogorov-Wiener optimal linear prediction

task.
Therefore, given that for the LRD process the autocorrelation function [22, 45]:

ACH (k) = 1/2 (k + 1)2H − 2k2H + (k − 1)2H
124 S. Frenkel

where k is the lag number, there is a monotonic relationship between the root-
mean-square error of linear prediction epsilon and the Hurst exponent in the interval
[0.5–1].
This analysis can be useful when using predictive software tools such as the
ARMA moving average autoregressive model, the AR autoregressive model, the
moving average MA model, and the ARIMA autoregressive integrated moving
average; see, for example in [36].
In the case of H > 1, prediction methods for stationary series become inefficient,
unexpected bursts with a magnitude much larger than expected from traditional
models, which is consistent with non-stationarity.
It is important, however, that when the initial process B is nonstationary, for the
increment, the fractional Brownian motion has stationary and dependent increments.
The last expression shows that the increments are positively correlated if H∈(1/2,
1), uncorrelated if H = 1/2, and negatively correlated if H ∈ (0, 1/2).
This property turns out to be important for predicting the sign of increments, and
will be discussed in Sect. 5.
Note, that most statistical forecasting methods are based on the assumption that
time series can be made approximately stationary (i.e. “stationary”) by going over
the difference in data over time, so that instead of directly considering index, we
calculate the difference between successive time steps.
However, the ability to predict data with a time difference, rather than the data
directly, is a much more significant indicator of the model’s predictive power.
Indeed, the prediction of a pure random walk is impossible in principle, but
success rate SR > 0.5 may appear simply due to a slight change in neighboring
values and an accuracy criterion that is insensitive to these changes (in a large
percentage of observed cases).
For a stationary process, the MSE linear prediction theory based on the
Kolmogorov-Wiener model assumes that the time series has a finite mean and
variance.
However, for time series with LRD properties, this cannot always be done. The
fact is that many real data streams formed as a result of overlaying data from several
sources have self-similarity and LRD-property, and at the same time are distributed
according to Pareto [24].

4.6 Influence of Probability Distributions of Processes

with LRD on Prediction

For systems with H ≥ 0.5, a Gaussian probability distribution function cannot be

used as a characteristic pdf. One should therefore look for an alternative distribution
function to characterize these systems. Numerous studies of LRD series have shown
that their distributions are close to Pareto [37]:

PPar (X) = aba /xa+1

On Impact of Data Models on Predictability Assessment of Time Series 125

where x ≥ a. The mean and variance of xt that follows are, respectively, given by:

μPar (x) = ab/ (a−1)

VarPar (x) = ab2 /(a–1)2 (a–2)

x ≥ a > 0, b > 0

It can be easily seen that μPar and VarPar do not exist if a = 1, 2.

If it obeys the Pareto law, then the above expression approaches infinity for
a = 2, no matter how large the estimation interval is. Therefore, the use of Wiener-
Kolmogorov predictors to conclude on the predictability of the LRD series relative
to the usual MSE is unacceptable.
Reference [37] shows the relationship between H and the parameters of the
Pareto distribution, which can provide finite values of expectation and variance, and
hence the possibility of linear prediction by the MSE criterion
For an LRD series with finite variance, the covariance slowly decreases to 0 as a
power function. Such time series can be called LRD time series with finite variance.
In this case, the distributions of the LRD values of the series can have heavy
tails [23]. shows that if the heavy-tailed distribution of a time series distribution
with LRD does not allow an acceptable estimate of mean and variance, the
MSE generalization can be used with Kolmogorov-Wiener predictor in a linear
combination of past observed values.
There are examples where a stochastic process shows heavy tails in the domain
of a probability distribution, [38], but the tail parameters can be used to represent
correlation functions of LRD processes with infinite variance.
Let us consider how these properties of LRD (self-similar) traffic affect the
technique for solving various prediction problems. If the task is to predict the next
values by the root-mean-square criterion, then it makes sense to compare with the
Kolmogorov-Wiener criterion as the basis of most linear regression methods.
Note that considering that for many problems of traffic prediction, the prediction
of increments can be a practically important task, the fact of their stationarity can
be effectively used, without searching for the optimal root-mean-square solution
(Sect. 5).
Let us now briefly consider the specifics of using nonlinear prediction models
with regard for time series with the considered properties.

4.7 Nonlinear Models and Neural Algorithms

A non-linear prediction model in modern IPs can be implemented either in non-

linear regression models or in artificial neural networks [39], for example, in
126 S. Frenkel

the deep learning models such as convolutional neural networks (CNNs), Graph
Convolutional Network (GCN) [40] and recurrent neural networks (RNN) [41], in
addition to machine learning algorithms such as Support Vector Regression (SVR)
[41, 42].
However, the neural network can have problems due to overfitting [4], when the
model explains well only the examples from the training set, adapting to the training
examples, instead of learning to classify the examples that did not participate in the
training. It is easy to see that self-similarity can contribute to this phenomenon. At
the same time, such a common way to reduce overfitting as Dropout – turning off
some neurons with a certain probability on some data interval from the training
process may not work due to the fact that training will be similar to the previous
one.
Also, possible sudden changes in cloud traffic can be easily confused by a neural
network with traffic anomalies, which leads to training inefficiency.
These sudden changes can be interpreted as non-stationarity by calculating the
values of H.
Another non-linear time series forecasting technique that is being tried for traffic
forecasting is support vector regression (SVR) [39], which is based on structural
risk minimization. However, the choice of suitable kernel functions and optimal
parameters is very difficult [43]. Examples are briefly discussed in [44].
From the above analysis of the dependence of the efficiency of using prediction
models on specific properties of time series, it follows that the natural way to take
into account the dependence of predictor properties on data behavior (stationarity,
nonlinearity of dependence) is their online selection in the process of solving the
network control problem.
This is confirmed in many publications. For example, in [47] it was concluded
that for predicting the response time and throughput of cloud services, at different
stages of resource use, artificial neural network and linear regression algorithms
have different efficiency. In other words, the overall prediction accuracy can
be improved by combining different prediction algorithms. However, due to the
computational complexity of prediction algorithms, the computational complexity
of choosing search tree prediction tools may be unacceptable for the problems of
using prediction in online network management (for examples when using popular
predictors, see Sect. 5).
It is important, that the data represented by time series is different from other
data models in the sense that it gives us additional information about the time of
occurrence of events that can be used when building a machine learning model. For
example, if the time series is highly correlated in time, then its value at time “t + 1”
is likely to be close to the value at time “t”, and the model actually does, one that
when predicting the value at time “t + 1”, it simply uses the value at time “t” as its
prediction.
However, if we are talking about predicting a change in the sign of the increment,
then this closeness does not give anything significant.
Therefore, let us consider separately the problem of predicting the sign of the
increments of random time series and processes.
On Impact of Data Models on Predictability Assessment of Time Series 127

5 Properties of Probability Distributions as Characteristics

of the Predictability of the Sign of Increments of Random
Time Series and Processes

An example of the usefulness of predicting the sign of traffic increments is, for
example, when attacks on a network are carried out in a way that cannot be detected
by antivirus software, and analysts have to rely on analyzing changes in network
traffic (traffic volume), or the direction of changes in file write intensities [46].
In some cases, when analyzing traffic, trend analysis is used not only because of
the usefulness of this characteristic itself, but also due to the fact that in this case
the forecast is more accurate than for the absolute values of the predicted traffic
characteristics (for example, traffic volume per unit time) [47].
Consider the problem of predicting the trend sign of the time series {xi } by the
incremental sequence t + 1 = xt + 1 – xt . In this case, t + 1 is a centered value,
with a sample mean close to zero for finite segments x. As is known from the theory
of random processes, the correlation between successive values of t + 1 turns out
to be much weaker [22] than in the original sequence {xt }. At first glance, this
may indicate a worse predictability of increments compared to the predictability
of the original xt values. This, however, concerns the values of the increments,
not their sign. Moreover, since the sign of the increment means the direction of
change, the sign of the residual autocorrelation of neighboring values is an important
characteristic, since from a practical point of view, a negative sign of autocorrelation
(“negative correlation”) means a tendency to a multidirectional trend in neighboring
sections of the sequence, which, obviously, can be used to predict the sign of
the change. Further, we will show that these intuitive assumptions have a certain
mathematical justification in the theory of random processes [48].
Definition 1 The change in the sign of the increment of values in ran-
dom data is predictable (and the sign of the increment is predictable) if
E(sign(t + 1 )/Ft ) = E(sign(t + 1 )), where E is the sign of the expectation
(according to the distribution of values, in this case t ), Ft is a part of the
probability space on which the random process is considered, which corresponds
to the observed values xt , or xt-k , . . . ,xt , or, say, some sample estimate of the
conditional mean xt + 1 , etc. (from a formal point of view, Ft corresponds to the
so-called “filtering” in the theory of random processes [30]).
In other words, a sign change is predictable if the probability of its current value
(“−” or “+”) changes when certain previous events Ft change.
In terms of the forecasting approach adopted in mathematical statistics, this
means that we can express, say, the probability of a positive change as:

Pr (t+1 > 0|Ft ) = E (I (t+1 > 0) |Ft ) ,

where I() is the indicator function.

128 S. Frenkel

It is easy to show [48] if the conditional distribution D(t + 1 |Ft ) can be

approximated by the normal distribution N(a, σ2 (t + 1 |t), where:

σ2 (t+1 |Ft ) = E( − E ())2 ,

a = E () ,

= {1, .. t , t + 1 } are increments over the entire area of consideration of the

random sequence X = {xt-k , . . . ,xt, ..},
σ2 (t + 1 |Ft ) = E( − E())2 is conditional variance of increments in time in
which events occur, in particular, over the entire time interval (t − k, . . . ,t) in which
values xt-k , . . . ,xt , then the probability of a positive sign:

Pr (t+1 > 0) = (a, σ (t+1 |Ft )) ,

where is the standard normal distribution function.

It is clear from the above that a zero mean would make the sign unpredictable
(at least for a normal increment distribution) in the sense of the definition given
above, since variations “up” and “down” with any variance about zero mean for a
symmetric distribution occur equiprobably.
For an arbitrary distribution D, it can be obtained that the minimum estimate of
the prediction error relative to the true sign of t + 1 of the assumed “loss function”:
Loss (t + 1, ’t + 1 ) = Et (I(t + 1 > 0) – ’t + 1|t )2 is achieved by estimating:

’t+1|t = E (I (t+1 > 0|t )) = P (t+1 > 0|t ) = 1 − F −at+1|t /σt+1|t

where, we repeat, the conditional expression ( . . . |t) means the calculation of the
observed values up to the moment t inclusive.
Therefore, the dynamics of changes in variations (“volatility”, to use the termi-
nology of financial mathematics) will affect the sign forecast in all cases when the
conditional mean is not equal to zero. Then the sign of the increment is predictable,
even if the conditional mean is unpredictable at zero mean.
If the distribution of F is skewed, then the sign can be predicted even if the
mean is zero: in this case, the time-varying skewness can be a determining factor in
predicting the sign.
But for a non-zero conditional mean (a > 0), even if the distribution is symmetric
about the conditional mean and the conditional mean is constant by assumption
(unlike the t-dependent variation of σ(t + 1|t), the sign of the increment is
predictable by the above definition.
So, conclusions about the predictability of a feature can only be based on
knowledge of the symmetry of the law of distribution of time series, modeled by
a random process with this law, and on the distribution parameters, namely, the
mean, variance, skewness, etc.
On Impact of Data Models on Predictability Assessment of Time Series 129

5.1 Binary Prediction Model of Sign

The considered approach to determining the predictability of the sign of increments

has a certain connection with another sign prediction paradigm based on the
consideration of a binary sequence corresponding to the signs. For example, it is
natural to assume that the value of the new binary sequence is 0 if xt -xt-1 ≤ 0 and
1 in the case of a positive increment, and use this binary sequence to predict the
corresponding statistics of the sign change sequence. At the same time, it is obvious
that this statistic corresponds to the statistic of the change in signs of the increment.
In the most general form, the concept of predictability for binary sequences can
be represented as follows.
Definition 2 A binary sequence x1 ,x2 , . . . from some distribution Z is predictable
for a predictor implementing some polynomial algorithm A if for each 1 < i < n and
any polynomial algorithm for estimating the next value rejects any statistical test for
the inequality [32]:

| Prob(A(x1 , x2 , ..xi−1) ) = xi ) − 1/2 |≤ O(ν(n))

where A(si−1 1 ) is an event consisting in observing a segment of the sequence up

to the moment i which consists in observing a segment of the sequence up to the
moment i, is the predicted value, O(ν(n)) denotes a function decreasing faster than
any polynomial in n.
It is easy to see that in this conceptual model, rule (2) with loss function (3) can
easily be used as a predictor [21].

5.2 On the Connection Between the Properties

of Autocorrelation Functions of Random Processes
and the Probabilities of Changing Signs of Increments

Consider the approach to sign prediction as a nonlinear short-term traffic prediction.

It is shown in [49] that the sign of the difference between neighboring obser-
vations x1 ,x2 , . . . , can be correctly predicted with a probability >1/2 despite the
well-known fact of decorrelation of independent increments (Sect. 2). Consider a
sequence of centered random variables y = {yi = xi – (Exi )}, i = 1, . . . ,t.. with zero
mean and probability density P(y).
If we assume that the increments yi are independent, then the conditional
expectation of its increments yi + 1 − yi at the last observed value of yi is:

E (yi+1 − yi |yi ) = E (yi+1 |yi ) − E (yi |yi ) = −yi

130 S. Frenkel

which follows from the fact that E(yi |yi ) = yi , E(yi + 1 |yi) = 0 with zero expectation,
and all increments {yi } are independent.
Accordingly, considering the conditional mean as a sign predictor, the following
rule is proposed for predicting the sign of the difference yi + 1 − yi :

sign (yi+1 − yi ) = − sign (yi ) (4)

In [49], it is stated that uncorrelated increments are sufficient to fulfill the indicated
sign relation. Although uncorrelated does not always mean independence, for
network traffic this assumption can be accepted. Indeed, in real processes, the
dependence of time-sequential values often takes place due to a non-stationary
trend, which is eliminated by centering yi ≡ xi – Exi , and in addition, most real data,
such as changes in traffic volumes, changes in the number of requested IP addresses,
etc. It somehow can be connected with a random change (decrease-increase) of some
external factors [50], and can be represented by random walk processes close to
processes with independent increments [22, 30].
Regarding the prediction accuracy according to the rule (4) (called in the [44]
as SC (“Sign criterion-based”) predictor), we can give more subtle mathematical
reasoning [42].
The share of successful predictions according to (4) for sufficiently long
sequences is proposed to be estimated as [49]:

1
R= + (1 − F (0)) F (0)
2
where:
x
F (x) = dyP (y)
−∞

y = yi+1 − yi, ;

P() is a probability distribution function (pdf), F(0) obviously represents the

probability of increments being negative and 1-F(0) positive.
It is easy to see that R reaches its maximum at F(0) =1/2, i.e. with a symmetrical
distribution of the increment (difference) of the initial random time series.
These formulas mathematically express, at first glance, a paradoxical result, that
when guessing the sign of the increment of independent random variables from the
previous value of the centered process, you can get the probability of success more
significantly more than ½!
This fact can also be given some elementary probabilistic justification.
Indeed, the truth of (4) depends on a combination of three conditions for the
ratios between yt + 1 , yt (i.e., we consider all possible obvious relations “>0” “0<”
“yt + 1 <> yt ”). In total, 23 = 8 such conditions can be formally written down
(zero values are excluded by measurement practice). Moreover, in 4 cases (4) are
performed correctly, two cases are contradictory (and therefore impossible), and
On Impact of Data Models on Predictability Assessment of Time Series 131

with two combinations there will be a deliberately false solution, namely, in the
cases yt + 1 > 0, yt > 0, yt + 1 > yt , yt + 1 < 0, yt < 0, yt + 1 < yt .
Assuming that all combinations are equally probable, we obtain the a priori
probability of the correct execution of (1) equal to 2/3.
At the same time, there is no reason to believe that two combinations leading to
incorrect predictions will occur on a certain interval several times more often than
those leading to correct ones.

5.3 Joint Use of SC Predictor and 1/0 Predictor

One of the signs of the possibility of using (4) as a predictor is the uncorrelatedness
(or extremely weak correlation) of successive differences in the values of the
considered time series (process) with a tendency to negative correlation of the
values yi + 1 -yi and yi , since the “anti-correlation” of two random variables means a
tendency to change in the opposite direction [22, 49].
The point is that if within a time interval the data is positively correlated, then
changes in a given direction will tend to future changes in the same direction, and
the path will be smoother than the normal Brownian motion process. If the data is
negatively correlated, then there will be a positive change more likely than a negative
one.
The proposed technique of joint use of the continuous and binary sign prediction
models presented here is applied in [44] (called as “SC-0/1 procedure”) and
compared with MLP, XGB, LSTM predictions widely used in modern practice.
Experiments show the high efficiency of the proposed approach.
It is essential that the rule (4) does not require the use of a long sequence of
past observations, which may be inefficient in case of strong non-stationarity and
nonlinearity of the predicted data.

6 Conclusion

As the analysis of the literature shows, many modern prediction tools based on the
principles of MO do not work effectively due to the pronounced nonlinearity of
traffic changes and non-stationarity, the possible inadequacy of assumptions about
the need for a large amount of previous observations. This paper is an attempt
at some ordering and categorization of a huge stream of publications on modern
methods, techniques and models of forecasting data of various nature, which should
to a certain extent simplify the search and analysis of some results of the theory of
random processes, allowing a quick assessment of the predictability of both absolute
data values and signs of their change.
For this, the concept of a conceptual model of algorithms for predicting the
state of systems in the subject area, widely used in modern Big Date and Software
132 S. Frenkel

Engineering, is adapted to represent specific probabilistic models of random

sequences and processes of various nature (symbolic, binary, real). Under such a
conceptual model, the paper refers to a formalized description of the relationships
between elements, objects (both real and model) and goals of mathematical models
and prediction algorithms, as well as the relationships between them, expressed in
probabilistic terms.
Such a construction allows one to represent classes of models according to the
types of predicted data, the probability measures used, and loss functions when
making decisions about the acceptability of the received forecast.
Guided by this CM, such sensitive elements of models as the specific manifesta-
tions of the nonlinearity of past and future relations, the degree of stationarity, the
characteristics of autocorrelation functions, the specificity of distribution laws, and
the criteria for prediction accuracy are singled out.
The proposed procedure for using a preliminary assessment of the indicated
characteristics of probabilistic models makes it possible to assess the presence of
significant non-stationarity in data flows, when it is impossible to consider long
sequences for predictor training. For highly non-stationary time series, the learning
process is associated with enumeration of parameters (for example, neural network
coefficients) if necessary, retraining of models caused by the above non-stationarity,
which can take a lot of time even on the most expensive computing systems (GPU,
clusters).
Among the tasks of forecasting, the task of predicting signs of increments
(direction of change) of the time series process is singled out separately. The
previously proposed prediction procedure [44] implemented as a simple heuristic
rule for predicting the increment of two neighboring values of a random sequence is
considered. The connection of this approach for time series with known approaches
for predicting binary sequences is shown.
At the same time, since the efficiency of prediction algorithms theoretically
depends on the volume of previous observations, which can be unreliable due to
the non-linear and non-stationary nature of the time series (simulating, for example,
the traffic of telecommunication networks, which is what rule (4) corresponds to).

References

1. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
2. Predict Traffic of LTE Network, [Online] https://www.kaggle.com/naebolo/predict-traffic-of-
lte-network. Accessed on 26 Oct 2020
3. Chen, A., Law, J., Aibin, M.: A survey on traffic prediction techniques using artificial
intelligence for communication networks. Telecom. 2(4), 518–535 (2021)
4. Zhang, J., Tan, D., Han, Zhu, H.: From machine learning to deep learning: Progress in machine
intelligence for rational drugv discovery. Drug Discov. Today. 22(11), 1680–1685 (2017)
5. Rooba, R., Vallimayil, V.: Semantic aware future page prediction based on domain. Int J. Pure
Appl. Math. 118(9), 911–919 (2018)
6. Ryabko, B.: Compression-based methods for nonparametric prediction and estimation of some
characteristics of time series. IEEE Trans. on Inf. Theory. 55(9), 4309–4315 (2009)
On Impact of Data Models on Predictability Assessment of Time Series 133

7. Brovelli, M., Sanso, F., Venuti, G.: A discussion on the Wiener–Kolmogorov prediction
principle with easy-to compute and robust variants. J. Geod. 76, 673–683 (2003)
8. Feder, M., Merhav, N.: Universal prediction. IEEE T. Inform. Theory. 44(6), 2124–2147
(1998)
9. Sharma, S..: Activation functions in neural networks (2019). Retrieved from https://
towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
10. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer Science + Business
Media, New York (2009)
11. Fettke, P.: Conceptual Modelling and Artificial Intelligence, Joint Proceedings of Modellierung
Short. Workshop and Tools & Demo Papers Workshop on Models in AI (2020)
12. Introduction to Unified Modeling Language (UML) 3rd INSPIRATION Training, GFA
(December 4–5, 2012)
13. Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-
time systems. In: In, Proc. 23rd International Conference on Computer Aided Verification
(CAV’11), pp. 585–591. Vol. 6806 of LNCS, Springer (2011)
14. Ryabko, B.: Prediction of random sequences and universal coding. Probl. Inf. Transm. 24(2),
3–14 (1988)
15. Ikharo, A.B., Anyachebelu, K.T., Blamah, N.V., Abanihi, V.: Optimising self-similarity
network traffic for better performance. Int J Sci Technol Res, Int J Sci Technol. Print ISSN:
2395-6011. https://doi.org/10.32628/IJSRST207413164
16. Cesa-Bianchi, N., Lugosi, G.: On prediction of individual sequences. Ann. Stat. 27(6), 1865–
1895 (1999)
17. Yu, P., Kuo, K.-S., Rilee, M.L., Yu, H.: Assessing Deep Neural Networks as Probability
Estimatorsar. Xiv:2111.08239v1 [cs.LG] (2021)
18. Ryabko, B., Monarev, V.: Using information theory approach to randomness testing. J. Stat.
Plan. Inference. 133, 95–110 (2005)
19. Miller, J.B., Sanjurjo, A.: Surprised by the Hot Hand Fallacy? A Truth in the Law of Small
Numbers. arXiv:1902.01265v1 [econ.GN] (2019)
20. Nikravesh, A., Ajila, S.A., Lung, C.-H.: An autonomic prediction suite for cloud resource
provisioning. J. Cloud Comput. Adv. Syst. Appl. 6, 3 (2017). https://doi.org/10.1186/s13677-
017-0073-4
21. Frenkel, S.: Theoretical aspects of a priori on-line assessment of data predictability in applied
tasks 5th International Symposium on Cyber Security Cryptology and Machine Learning
CSCML 2021. LNCS. 12716, 187–195 (2021)
22. Buket Coskun, B., Vardar-Acar, C., Demirtas, H.: A Generalized Correlated,: Random Walk,
Converging to Fractional Brownian Motion. arXiv:1903.05424v3 (2019)
23. Ming, L., Jia-Yue, L.: On the Predictability of Long-Range Dependent Series. Mathematical
Problems in Engineering Volume (2010). https://doi.org/10.1155/2010/397454
24. Brignoli, D.: DDOS detection based on traffic self-similarity (n.d.). https://ir.canterbury.ac.nz/
bitstream/handle/10092/2105/Thesis_fulltext.pdf;sequence=2
25. Graf, S.: Statistically self-similar fractals. Prob. Th. Rel. Fields. 74, 357–392 (1987)
26. Park, R., Hernández-Campos, F., Le, L., Marron, J., Park, J., Pipiras, V., Smith, F., Smith, L.,
Trovero, M., Zhu, Z.: Long-range dependence analysis of internet traffic. J. Appl. Stat. 38(7),
1407–1433 (2011)
27. Megues, P., Molnar, S.: Analysis of Elephant Users in Broadband Network Traffic. 19th
EUNICE Workshop on Advances in Communication Networking (2013). https://doi.org/
10.1007/978-3-642-40552-5_4
28. Leland, W.E., et al.: On the Self-Similar Nature of Ethernet Traffic (Extended Version), pp.
1–15. IEEE Press, Piscataway (1994)
29. Becchi M., From Poisson Processes to Self-Similarity: a Survey of Network Traffic Models.
2008., https://www.cse.wustl.edu/~jain/cse567-06/ftp/traffic_models1/index.html
30. Bosq, D., Nguyen, H.: A Course in Stochastic Processes. Stochastic Models and Statistical
Inference, Kluwer, Dordrecht (1996)
134 S. Frenkel

31. Pan, C., Wang, Y., Shi, H., Shi, J., Cai, R.: Network traffic prediction incorporating prior
knowledge for an intelligent network. Sensors. 22(7), 2674 (2022)
32. Lavasani, A., A., Eghlidos, T.: Practical next bit test for evaluating pseudorandom sequences.
Comput. Sci. Eng. Electric. Eng. 16(1), 19–33 (2009)
33. Park, C., Hernandez, F., Le, L., Marron, J.S., Park, J., Pipiras, V., Smith, F.D., Smith, R.L.,
Trovero, M., Zhu, Z.: Long range dependence analysis of Internet traffic. Journal of Applied
Statistics. 38, 1407–1433 (2004)
34. He, H.: 1 Shitao Cheng, 1 and Xiaofu Zhang, signal nonstationary degree evaluation method
based on moving statistics theory. Shock. Vib. 2021., Article ID 5562110, 18 (2021). https://
doi.org/10.1155/2021/5562110
35. Box, G.E.P., Jenkins, G.M., Reinsel, G.C.: Time Series Analysis: Forecasting and Control.
Wiley, New York (2008)
36. Lyman, J., Edmonson, W., McCullough, C., Rao, M.: The predictability of continuous-time
band limited processes. IEEE Trans. Signal Process. 48(2), 311–316 (2000)
37. Song, W., Duan, S., Chen, D., Zio, E., Yan, W., Cai, F.: Finite iterative forecasting model based
on fractional generalized Pareto motion. Fractal Fract. 6, 471 (2022)
38. Loiseau, P., Gonçalves, P., Dewaele, G., Borgnat, P., Abry, P., Primet, P.: Investigating self-
similarity and heavy-tailed distributions on a large-scale experimental facility. IEEE/ACM
Trans. netw. 18, 1261–1274 (2010)
39. Chen, A., Law, J., Aibin, M.: A survey on traffic prediction techniques using artificial
intelligence for communication networks. Telecom. 2, 517–536 (2021)
40. Vinchoff, C., Chung, N., Gordon, T., Lyford, L., Aibin, M.: Traffic Prediction in optical
networks using graph convolutional generative adversarial networks. In: In Proceedings of the
International Conference on Transparent Optical Networks, pp. 3–6. Bari, Italy (2020)
41. Aibin, M.: Deep Learning for Cloud Resources Allocation: Long-Short Term Memory in
EONs. In Proceedings of the International Conference on Transparent Optical Networks,
Angers, France, 9–13 July 2019; pp. 8–11
42. Yin, F., Wang, J., Guo, C. (eds.): A Boosting-Based Framework for Self-Similar and Non-linear
Internet Traffic Prediction ISNN 2004, pp. 931–936. LNCS 3174 (2004)
43. Shi, Y., Fernando, B., Hartley, R.: Action Anticipation with RBF Kernelized Feature Mapping
RNN. arXiv:1911.07806v3 [cs.CV] 11 Jul (2021)
44. Frenkel, S.: Predicting the direction of changes in the values of time Series for relatively small
training samples. In: 6th International Symposium on Cyber Security Cryptology and Machine
Learning CSCML 2021CSCML, Beer-Sheva, Israel, pp. 118–134. Proceedings, Lecture Notes
in Computer Science (13301) (2022)
45. The Influence of Long-Range Dependence on Traffic Prediction Sven A. M. Östring, H.
Sirisena Published 11 June 2001 Computer Science ICC 2001. IEEE International Conference
on Communications. Conference Record (Cat. No.01CH37240)
46. Nikravesh, A.Y., Ajila, S.A., Lung, C.-H.: An autonomic prediction suite for cloud. J. Cloud
Comput. Adv. Syst. Applic. 6, 3 (2017). https://doi.org/10.1186/s13677-017-0073-4
47. Zhao, A., Liu, Y.: Application of Nonlinear Combination Prediction Model for Network Traf-
fic. 2nd International Conference on Electronic & Mechanical Engineering and Information
Technology (EMEIT-2012). Proceedings, 2337–2340 (2012)
48. Christoffersen, P., Diebold, F.: Financial asset returns, direction-of-change forecasting, and
volatility dynamics. Manag. Sci. 52(8), 1273–1287 (2006)
49. Sornette, D., Andersen, J.: Increments of uncorrelated time series can be predicted with a
universal 75% probability of success. Int. J. Mod. Phys. 11(4), 713–720 (2000)
50. Cloud, B.L., Dalmazo, L., Vilela, M.: Performance analysis of network traffic predictors. J.
Netw Syst Manage. 25, 290–320 (2017)
51. Aryan, M.: Efficient Methods for Large-Scale Empirical Risk Minimization. A Doctoral
Thesis, Philadelphia, PA (2017)
52. Kleeman, R.: Information theory and dynamical system predictability. Entropy. 13, 612–649
(2011)
A Three-Step Method for Audience
Extension in Internet Advertising Using
an Industrial Taxonomy

Dmitry Frolov and Zina Taran

1 Introduction

Modern technologies transform many areas of human life and modify industries.
The era of Big Data brought on remarkable possibilities for extracting meaningful
insight from the data. In particular, the growth of the digital advertising, whereby
an increasing share of advertisement moves from traditional formats (such as
TV, radio, out-door) to the Internet both accelerates the production of raw data
and creates the demand for insight generated from it. Digital advertising brings
new ways to investigate potential audiences, create and control advertisements,
and evaluate results. Companies wish to predict consumer preferences, determine
relevant customer segments for particular products or services, and target marketing
offers based on the data. The amount and diversity of data on one hand, and the
rapidly increasing sophistication of methods on the other allow for such predictions
to be carried out with ever increasing accuracy. For example, they allow marketers
to expand on their time-tested ideas of targeting consumers. Whereas in the past
simple regression methods allowed a company to identify a group of people that
were very much like their existing customers, nowadays it becomes possible to go
beyond that to what is now referred to as an audience extension. In general, ability
to derive better, more accurate insight and to extract more information from the data
becomes essential for any business that wishes to remain competitive. The objective
of this paper is to propose a better method of extracting useful user insight from the

D. Frolov ()
Department of Data Analysis and Artificial Intelligence, HSE University, Moscow, Russia
Z. Taran
Division of Management, Marketing, and Business Administration, Delta State University,
Cleveland, MS, USA

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 135
B. Goldengorin, S. Kuznetsov (eds.), Data Analysis and Optimization, Springer
Optimization and Its Applications 202,
https://doi.org/10.1007/978-3-031-31654-8_8
136 D. Frolov and Z. Taran

data that the organization is likely already collecting without the need to gather new
information.

2 Programmatic Targeting in Internet Advertising

Nowadays, digital advertising approaches allow an advertiser to choose appropriate

user audiences for their campaigns. The most popular approach is programmatic,
a recently emerged technique. Programmatic brings real-time bidding (RTB) to
allow an advertiser to make their purchasing decisions separately for every contact
with every user [6, 17]. A real-time bidding system provides an auction for the
advertisement impression between multiple advertisers, and each advertiser has a
possibility to buy relevant audience only [7, 13, 16]. This technology relies on
assigning users with their interest profiles containing estimates of levels of their
interest in this or that market segment. Consider a conventional, currently much
popular, approach to programmatic selection of targeted audiences [2, 14]. This
approach (CAS) requires to pre-specify a threshold t (usually, .t = 0.3 or .t = 0.4)
and formulate a condition: Given a set of market segments provided by an advertiser
and a user’s profile, check whether the profile has one or more of the advertiser’s
market segments. If yes, the user is selected as part of the audience if at least one of
these segments has its fuzzy weights greater than t, according to the CAS rule.
An issue with CAS is that the number of users satisfying the condition at
threshold t may be less than the number specified in the advertisement order; say,
they want to show their advert to a million users, but only three hundred thousand of
those under observation satisfy the condition. In this case, a conventional strategy
is to have CAS extended (CASE) by lessening t to .t , .t < t, so that more users
satisfy the condition at .t than at t. This may increase the number of users exposed
to the advert indeed, but usually the efficiency lessens, because the added users have
a weaker tendency to be impressed by the advert. A similar diminishing response
rates can be caused by popular the so-called look-alike techniques [8].
To overcome this issue without decreasing the threshold, we propose using
segments profiles ‘generalized’ over an industrial taxonomy rather than the original
ones. This method utilizes a novel algorithm for most adequate generalization in
taxonomies [1] to extend user segments by parsimoniously lifting them over IAB
content taxonomy into a higher rank ‘head subject’. This algorithm was proposed
as an intelligent information retrieval tool [1]. Here it is applied to a very different
task of targeted advertisement.
An approach for user profiling to involve an industrial ontology, to which we
follow, was originated by [9] and further advanced by various authors such as
[2]. The approach involves the following blocks: (a) an industrial ontology in the
format of a rooted tree taxonomy, to provide for marketing user behavior segments,
(b) a system for tracking users’ web surfing histories, and (c) a device to convert
the users’ histories into user profiles. A user profile assigns weights to taxonomy
segments according to user’s interests in them. Advertising system has to detect
A Three-Step Method for Audience Extension over Internet Advertising 137

users belonging to chosen segments and provide advertisement impressions for

them. To apply this approach, we use industrial taxonomy IAB (see https://www.
iab.com/ [3]). As a device converting user surfing histories into user profiles, we use
a random forest classifier based on paper [15].
The paper is organized as follows. Section 3 is devoted to a model and a
method of computation generalization. Section 4 describes our combined method,
Development and Lifting of User Segments Profile (DLUSP), to extend audiences
of targeted advertising. Section 5 describes results of several real-world experiments
comparing a popular conventional approach and DLUSP: our method does increase
the number of successful matches between user segments and campaign segments
2.5-3-fold without losing in the targeting quality. Section 6 draws a conclusion and
lists directions for future works.

3 Parsimoniously Generalization a Fuzzy Thematic Subset in

Taxonomy

Let us consider the main definitions related to generalization in taxonomies,

according to [1], to describe the principles of computational generalization in detail.
Mathematically, a taxonomy is a rooted tree whose nodes are annotated by
taxonomy topics. We consider the following problem. Given a fuzzy set S of
taxonomy leaves, find a node .t (S) of higher rank in the taxonomy, that covers
the set S in a most specific way. Such a “generalization”, or “lifting” problem is
a mathematical explication of the human facility for generalization, that is, “the
process of forming a conceptual form” of a phenomenon represented, in this case,
by a fuzzy leaf subset.
Consider, for the sake of simplicity, a hard set S shown with five black leaf boxes
on a fragment of a tree in Fig. 1. Figure 2 illustrates the situation at which the set
of black boxes is lifted to the root, which is shown by blackening the root box, and
its offspring, too. If we accept that set S may be generalized by the root, this would
lead to a number, four, white boxes to be covered by the root and, thus, in this way,
falling in the same concept as S even as they do not belong in S. Such a situation
will be referred to as a gap. Lifting with gaps should be penalized. Altogether, the
number of conceptual elements introduced to generalize S here is 1 head subject,
that is, the root to which we have assigned .S, and the 4 gaps occurred just because
of the topology of the tree, which imposes this penalty. Another lifting decision is
illustrated in Fig. 3: here the set is lifted just to the root of the left branch of the tree.
We can see that the number of gaps has drastically decreased, to just 1. However,
another oddity emerged: a black box on the right, belonging to S but not covered by
the root of the left branch at which the set S is mapped. This type of error will be
referred to as an offshoot. At this lifting, three new items emerge: one head subject,
one offshoot, and one gap. This is less than the number of items emerged at lifting
the set to the root (one head subject and four gaps, that is, five), which makes it
138 D. Frolov and Z. Taran

Fig. 1 A crisp query set,

shown by black boxes, to be
conceptualized in the
taxonomy

Fig. 2 Generalization of the

query set from Fig. 1 by
mapping it to the root, with
the price of four gaps
emerged at the lift

Fig. 3 Generalization of the

query set from Fig. 1 by
mapping it to the root of the
left branch, with the price of
one gap and one offshoot
emerged at this lift

more preferable. Of course, this conclusion holds only if the relative weight of an
offshoot is less than the total relative weight of three gaps.
We are interested to see whether a fuzzy set S can be generalized by a node t
from higher ranks of the taxonomy, so that S can be thought of as falling within the
framework covered by the node t. The goal of finding an interpretable pigeon-hole
for S within the taxonomy can be formalized as that of finding one or more “head
subjects” t to cover S with the minimum number of all the elements introduced at the
generalization: head subjects, gaps, and offshoots. This goal realizes the principle
of Maximum Parsimony (MP).
Consider a rooted tree T representing a hierarchical taxonomy so that its nodes
are annotated with key phrases signifying various concepts. We denote the set of all
its leaves by I . The relationship between nodes in the hierarchy is conventionally
expressed using genealogical terms: each node .t ∈ T is said to be the parent of
the nodes immediately descending from t in T , its children. We use .χ (t) to denote
the set of children of t. Each interior node .t ∈ T − I is assumed to correspond to a
concept that generalizes the topics corresponding to the leaves .I (t) descending from
A Three-Step Method for Audience Extension over Internet Advertising 139

t, viz. the leaves of the subtree .T (t) rooted at t, which is conventionally referred to
as the leaf cluster of t.
A fuzzy set on I is a mapping u of I to the non-negative real numbers that assigns
a membership value, or support, .u(i) ≥ 0 to each .i ∈ I . We refer to the set .Su ⊂ I ,
where .Su = {i ∈ I : u(i) > 0}, as the base of u. In general, no other assumptions
are made about the function u, other than, for convenience, commonly limiting it
to not exceed unity. Conventional, or crisp, sets correspond to binary membership
functions u such that .u(i) = 1 if .i ∈ Su and .u(i) = 0 otherwise.
Given a fuzzy set u defined on the leaves I of the tree T , one can consider u to be
a (possibly noisy) projection of a higher rank concept, u’s “head subject”, onto the
corresponding leaf cluster. Under this assumption, there should exist a head subject
node h among the interior nodes of the tree T such that its leaf cluster .I (h) more or
less coincides (up to small errors) with .Su . This head subject is the generalization of
u to be found. The two types of possible errors associated with the head subject, if it
does not cover the base precisely, are false positives and false negatives, referred to
in this paper, as gaps and offshoots, respectively. They are illustrated in Figs. 2 and
3. Given a head subject node h, a gap is a node t covered by h but not belonging
to u, so that .u(t) = 0. In contrast, an offshoot is a node t belonging to u so that
.u(t) > 0 but not covered by h. Altogether, the total number of head subjects, gaps,

and offshoots has to be as small as possible. To this end, we introduce a penalty

for each of these elements. Assuming for the sake of simplicity, that the black box
leaves on Fig. 1 have membership function values equal to unity, one can easily see
that the total penalty at the head subject raised to the root (Fig. 2) is equal to .1 + 4γ
where 1 is the penalty for a head subject and .γ , the penalty for a gap, since the lift
on Fig. 2 involves one head subject, the root, and four gaps, the blank box leaves.
Similarly, the penalty for the lift on Fig. 3 to the root of the left-side subtree is equal
to .1 + γ + λ where .λ is the penalty for an offshoot, as there is one copy of each,
head subject, gap, and offshoot, in Fig. 3. Therefore, depending on the relationship
between .γ and .λ either lift on Fig. 2 or lift on Fig. 3 is to be chosen. That will be the
former, if .3γ < λ, or the latter, if otherwise.
Consider a candidate node h in T and its meaning relative to fuzzy set u. An h-
gap is a node g of .T (h), other than h, at which a loss of the meaning has occurred,
that is, g is a maximal u-irrelevant node in the sense that its parent is not u-irrelevant.
Conversely, establishing a node h as a head subject can be considered as a gain of
the meaning of u at the node. The set of all h-gaps will be denoted by .G(h). A node
.t ∈ T is referred to as u-irrelevant if its leaf-cluster .I (t) is disjoint from the base

.Su . Obviously, if a node is u-irrelevant, all of its descendants are also u-irrelevant.

An h-offshoot is a leaf .i ∈ Su which is not covered by h, i.e., .i ∈/ I (h). The set of

all h-offshoots is .Su − I (h). Given a fuzzy topic set u over I, a set of nodes H will
be referred to as a u-cover if: (a) H covers .Su , that is, .Su ⊆ h∈H I (h), and (b) the
nodes in H are unrelated, i.e. .I (h) ∩ I (h ) = ∅ for all .h, h ∈ H such that .h = h .
The interior nodes of H will be referred to as head subjects and the leaf nodes as
offshoots, so the set of offshoots in H is .H ∩ I . The set of gaps in H is the union of
.G(h) over all head subjects .h ∈ H − I .
140 D. Frolov and Z. Taran

We define the penalty function .p(H ) for a u-cover H as:

.
p(H ) = u(h) + + λv(g) + γ u(h). (1)
h∈H −I h∈H −I g∈G(h) h∈H ∩I

The problem we address is to find a u-cover H that globally minimizes the

penalty .p(H ). Such a u-cover is the parsimonious generalization of the set u.
Before applying an algorithm to minimize the total penalty, one needs to execute
a preliminary transformation of the tree by pruning it from all the non-maximal
u-irrelevant nodes, i.e. descendants of gaps. Simultaneously,
the sets of gaps .G(t)
and the internal summary gap importance .V (t) = g∈G(t) v(g) in Eq. (1) can be
computed for each interior node t. We note that the elements of .Su are in the leaf
set of the pruned tree, and the other leaves of the pruned tree are precisely the gaps.
After this, our lifting algorithm ParGenFS applies. For each node t, the algorithm
ParGenFS computes two sets, .H (t) and .L(t), containing those nodes in .T (t) at
which respectively gains and losses of head subjects occur (including offshoots).
The associated penalty .p(t) is computed too.
An assumption of the algorithm is that no gain can happen after a loss. Therefore,
.H (t) and .L(t) are defined assuming that the head subject has not been gained (nor

therefore lost) at any of t’s ancestors. The algorithm ParGenFS recursively computes
.H (t), .L(t) and .p(t) from the corresponding values for the child nodes in .χ (t).

Specifically, for each leaf node that is not in .Su , we set both .L(·) and .H (·) to
be empty and the penalty to be zero. For each leaf node that is in .Su , .L(·) is set to
be empty, whereas .H (·), to contain just the leaf node, and the penalty is defined as
its membership value multiplied by the offshoot penalty weight .γ . To compute .L(t)
and .H (t) for any interior node t, we analyze two possible cases: (a) when the head
subject has been gained at t and (b) when the head subject has not been gained at t.
In case (a), the sets .H (·) and .L(·) at its children are not needed. In this case,
.H (t), .L(t) and .p(t) are defined by:

H (t) = {t}
. L(t) = G(t) (2)
p(t) = u(t) + λV (t).

In case (b), the sets .H (t) and .L(t) are just the unions of those of its children, and
p(t) is the sum of their penalties:
.

H (t) = H (w)
w∈χ (t)

.
L(t) = L(w) (3)
w∈χ (t)

p(t) = p(w).
w∈χ (t)
A Three-Step Method for Audience Extension over Internet Advertising 141

To obtain a parsimonious lift, whichever case gives the smaller value of .p(t) is
chosen.
When both cases give the same values for .p(t), we may choose, say, (a). The
output of the algorithm consists of the values at the root, namely, H – the set of
head subjects and offshoots, L – the set of gaps, and p – the associated penalty.
ParGenFS Algorithm
INPUT: u, T
OUTPUT: .H = H (Root), .L = L(Root), .p = p(Root)
I Base case: for each leaf of the T .
for each leaf .i ∈ I
.L(i) =

if .u(i) > 0
.H (i) = {i}

.p(i) = γ u(i)

else
.H (i) =

.p(i) = 0

II Recursion: for each internal node of the T , down up to the Root.

.pgain = u(t) + λV (t)

.pnogain = w∈χ (t) p(w)
if .pgain ≤ pnogain
.H (t) = {t}

.L(t) = G(t)

.p(t) = pgain

else
.H (t) =
w∈χ (t) H (w)
.L(t) = w∈χ (t) L(w)
.p(t) = pnogain

III Return the sets .H = H (Root), .L = L(Root), .p = p(Root).

It was mathematically proven that the algorithm ParGenFS leads to an optimal
lifting indeed [1].

4 A 3-Step Method, Developing and Lifting of User Segments

Profiles

Intuitively, the method consists of three steps: (1) computing membership values
for the interest segments for a user by a classifier; (2) performing generalization of
those sets and obtaining high-ranked segments, which is a core part of the method;
(3) obtaining a set of advertising campaigns for a user.
There are three inputs to our method, Developing and Lifting of User Segment
Profile (DLUSP). These are: (i) a large set of internet users from which the audience
142 D. Frolov and Z. Taran

Fig. 4 IAB Contents taxonomy fragment

is recruited for an advertisement; (ii) an industrial taxonomy of goods and services;

(iii) a set of taxonomy segments relevant to an advertisement under consideration.
The user set in (i) is maintained by a special service storing information of
millions individual users visiting popular sites such as amazon.com or ozon.ru
for making purchases, getting services, etc., for native advertising. Such is Data
Management Platform (DMP) developed in a small company, start-up Natimatica,
Ltd. (see https://natimatica.com).
The industrial taxonomy in (ii) is exemplified by the IAB Content taxonomy [3],
which is a 4-layer rooted tree of taxonomy topics. We focus on its leaf segments. A
fragment from Business and Finance branch of the IAB taxonomy can be seen in
Fig. 4.
The set of taxonomy segments in (iii) comes from a chat between a specially
assigned company employee and the advertiser of the contents of their advertise-
ment.
Our method DLUSP includes two technically loaded components: Development
of user segment profile (DUSP) and Lifting of user segment