Data Analysis and Optimization
Data Analysis and Optimization
Boris Goldengorin
Sergei Kuznetsov Editors
Volume 202
Series Editors
Panos M. Pardalos , University of Florida, Gainesville, FL, USA
My T. Thai , CSE Building, University of Florida, Gainesville, FL, USA
Advisory Editors
Roman V. Belavkin, Faculty of Science and Technology, Middlesex University,
London, UK
John R. Birge, University of Chicago, Chicago, IL, USA
Sergiy Butenko, Texas A&M University, College Station, TX, USA
Vipin Kumar, Dept Comp Sci & Engg, University of Minnesota, Minneapolis, MN,
USA
Anna Nagurney, Isenberg School of Management, University of Massachusetts
Amherst, Amherst, MA, USA
Jun Pei, School of Management, Hefei University of Technology, Hefei, Anhui,
China
Oleg Prokopyev, Department of Industrial Engineering, University of Pittsburgh,
Pittsburgh, PA, USA
Steffen Rebennack, Karlsruhe Institute of Technology, Karlsruhe, Baden-
Württemberg, Germany
Mauricio Resende, Amazon (United States), Seattle, WA, USA
Tamás Terlaky, Lehigh University, Bethlehem, PA, USA
Van Vu, Department of Mathematics, Yale University, New Haven, CT, USA
Michael N. Vrahatis, Mathematics Department, University of Patras, Patras, Greece
Guoliang Xue, Ira A. Fulton School of Engineering, Arizona State University,
Tempe, AZ, USA
Yinyu Ye, Stanford University, Stanford, CA, USA
Honorary Editor
Ding-Zhu Du, University of Texas at Dallas, Richardson, TX, USA
Aims and Scope
Optimization has continued to expand in all directions at an astonishing rate. New
algorithmic and theoretical techniques are continually developing and the diffusion
into other disciplines is proceeding at a rapid pace, with a spot light on machine
learning, artificial intelligence, and quantum computing. Our knowledge of all
aspects of the field has grown even more profound. At the same time, one of the
most striking trends in optimization is the constantly increasing emphasis on the
interdisciplinary nature of the field. Optimization has been a basic tool in areas
not limited to applied mathematics, engineering, medicine, economics, computer
science, operations research, and other sciences.
The series Springer Optimization and Its Applications (SOIA) aims to publish
state-of-the-art expository works (monographs, contributed volumes, textbooks,
handbooks) that focus on theory, methods, and applications of optimization. Topics
covered include, but are not limited to, nonlinear optimization, combinatorial opti-
mization, continuous optimization, stochastic optimization, Bayesian optimization,
optimal control, discrete optimization, multi-objective optimization, and more. New
to the series portfolio include Works at the intersection of optimization and machine
learning, artificial intelligence, and quantum computing.
Volumes from this series are indexed by Web of Science, zbMATH, Mathematical
Reviews, and SCOPUS.
Boris Goldengorin • Sergei Kuznetsov
Editors
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
This book presents the state of the art in the emerging field of data science which
includes models for layered security, protection of large gathering sites, cancer
diagnostics, self-driving cars and other applications with catastrophic consequences
of wrong decisions. The manipulability of aggregation procedures for the case of
large numbers of voters is analyzed from a theoretical point of view and justified
by computational experiments involving at least an order of magnitude larger
number of voters. Many tree-type structures are considered: from phylogenetic
trees representing the main patterns of vertical descent through consensus trees
and super- trees widely used in evolutionary studies to combine phylogenetic
information contained in individual gene trees. The statistical part of this book
studies an impact of data mining and modeling on predictability assessment of time
series. New notions of mean values based on ideas of multicriteria optimization are
compared to their conventional definitions leading to fresh algorithmic approaches.
To summarize, the book presents methods for automated analysis of patterns
and models for data of different nature with applications ranging from scientific
discovery to business intelligence and analytics. The style of the written chapters
allows to recommend this book for senior undergraduate and graduate data mining
courses providing a broad yet in-depth review integrating novel concepts from
machine learning and statistics. The main parts of the book include exploratory data
analysis, pattern mining, clustering, and classification supported by real life case
studies.
Students and professionals specializing in computer and management science,
data mining for high-dimensional data, complex graphs, and networks will benefit
from many cutting-edge ideas and practically motivated case studies.
From a technical point of view, each paper has received at least two referee
reports, and almost all papers are presented at the International conference “Data
Analysis, Optimization and Their Applications” on the occasion of Boris Mirkin’s
80th birthday, January, 30–31, 2023. Dolgoprudny, Moscow Region, Moscow
Institute of Physics and Technology.
With the purpose to keep the atmosphere of that conference, we have included
the Book of Abstracts and its schedule. We would like to thank all speakers and
v
vi Preface
contributors to the conference and volume dedicated to Boris Mirkin’s 80th birthday.
Without the crucial support on organizational issues, the conference and the book
could not have become a reality. Here is the list of our supporters and reviewers:
Fuad Aleskerov, NRU HSE, Russia
Daniel Berend, Ben Gurion University, Israel
Jean Diatta, University of la Réunion, France
Trevor Fenner, University of London, UK
Sergei Frenkel, FRC Computer Science and Control, Russia
Alexander Gasnikov, MIPT, Russia
Boris Goldengorin, University of Florida, USA
Eugene Koonin, NIH/NLM/NCBI, USA
Sergei Kuznetsov, NRU HSE, Russia
Boris Kovalerchuk, Central Washington University, USA
Leonid Litinskii, ISA RAS, Russia
Vladimir Makarenkov, University of Quebec in Montreal, Canada
Magomed Malsagov, ISA RAS, Russia
Boris Mirkin, NRU HSE, Russia and UL London, UK
Susana Nascimento, New University of Lisbon, Portugal
Alexandre Orlov, Bauman Moscow State Technical University, Russia
Panos Pardalos, University of Florida, USA
Natalia Pavlova, MIPT, Russia
Andrei Raigorodskii, MIPT, Russia
Alexey Remizov, MIPT, Russia
Fred Roberts, Rutgers University, USA
Yuri Sidel’nikov, MAI, Russia
Konstantin Vorontsov, Lomonosov MSU, Russia
Organization Committee
vii
viii Program and Abstract Book
Invited Speakers
Schedule
Time Speaker
11:00–11:25 Fuad Aleskerov
11:30–11:55 Alexander Lepskiy
12:00–12:25 Irina Maximova
Coffee break
13:00–13:25 Yuliya A Veselova
13:30–13:55 Tendai Chikake
14:00–14:25 Alexey Samosyuk
Lunch
15:00–15:25 Alexander Rubchinsky
15:30–15:55 Susana Nascimento
16:00–16:25 Dmitry Frolov
Coffee break
17:00–17:25 Aleksandr Beznosikov
17:30–17:55 E Dov Neimand
18:00–18:25 Boris Kovalerchuk
18:30–18:55 Alexander Karpov
x Program and Abstract Book
Authors A.S. Markov, E.Yu. Kotlyarov, N.P. Anosova, V.A. Popov, Yakov
Karandashev, D.E. Apushkinskaya
Abstract This paper concerns with the problem of detecting anomalies on X-ray images
taken by full-body scanners (FBS). Our previous work describes the sequence of
image preprocessing methods used to convert the original images, which are
produced with FBS, to images with visually distinguishable anomalies. In this
paper, we focus on development of the proposed methods, including the
addition of preprocessing methods and the creation of generator which can
produce synthetic anomalies. Examples of processed images are given. The
results of using a neural network for anomaly detection are shown.
Affiliation Peoples’ Friendship University of Russia (RUDN University), Moscow, Russian
Federation
Contact [email protected],[email protected],[email protected],
[email protected],[email protected],[email protected]
Keywords Full-body scanner, X-ray image, anomaly detection, image histogram
equalization, generator, neural network, U-Net
√
Authors Susana Nascimento, Alexandre Martins, Paulo Relvas, Joaquim F. Lu =s,
Boris Mirkin
Abstract This work proposes a three-stage spatiotemporal clustering approach for the
automatic recognition and analysis of coastal upwelling from Sea Surface
Temperature (SST) grids derived from satellite images.
The algorithm, core-shell clustering, models the upwelling as an evolving
cluster whose core points are constant during a certain time window while the
shell points move through an in-and-out binary sequence. The least squares
minimization of clustering criterion allows to derive key parameters in an
automated way.
The algorithm is initialized with an extension of Seeded Region Growing
offering self-tuning thresholding, the STSEC algorithm, that is able to precisely
delineate the upwelling regions at each SST instant grid. Yet, the application of
STSEC to the SST grids as temporal data puts the business of finding relatively
stable “time windows”, here called “time ranges”, for obtaining the core clusters
onto an automated footing.
The approach successfully applies to SST image data for sixteen successive
years of coastal upwelling of the Canary Current Upwelling System, covering
two distinct regions: the Portuguese coast and the Morocco coast. The extracted
time series of upwelling features presented consistent regularities among the
upwelling seasons.
Affiliation Dep. of Computer Science and NOVA Laboratory for Computer Science and
Informatics (NOVA-LINCS) Faculdade de Ciências e Tecnologia, Universidade
Nova de Lisboa, Lisbon, Portugal, Centre of Marine Sciences (CCMAR),
Universidade do Algarve, Faro, Portugal, Instituto Dom Luís, Universidade de
Lisboa, Lisboa Portugal, and Universidade do Algarve,Faro, Portugal
Department of Data Analysis and AI, Higher School of Economics, Moscow,
Russia, Department of Computer Science, Birkbeck University of London,
London, UK)
Contact [email protected]
Keywords Spatiotemporal clustering, Time series segmentation, Data recovery clustering,
SST images, Coastal upwelling
xxiv Program and Abstract Book
xxvii
xxviii Contents
Boris Goldengorin is the author and inventor of data correcting and tolerance-
based algorithms applied to many problems in operations research, supply chain
management, quantitative logistics, industrial engineering, data and stock market
analysis. Boris is the author of more than 100 articles published in leading
international journals, including the Journal of Algebraic Combinatorics, Dis-
crete Optimization, Journal of Combinatorial Optimization, Journal of Global
Optimization, Operations Research, Management Science, European Journal of
Operational Research, Journal of Operational Research Society, Mathematical
and Computer Modelling, Computers and Operations Research, Computers and
Industrial Engineering, Expert Systems with Applications, Journal of Heuristics,
Optimization Methods and Software, Computational Management Science, and
many others. Dr. Goldengorin has published four monographs, three textbooks, and
an editor of five books on mathematical programming, game theory, combinatorial
optimization, network analysis algorithms, graph theory, and big data analysis.
He is an associate editor of Journal of Global Optimization, Journal of Combina-
torial Optimization, SN Operations Research Forum, and member of the Editorial
Board of the Journal of Computational and Applied Mathematics of the Taras
Shevchenko National University of Kyiv, Ukraine.
Sergei Kuznetsov graduated from the Faculty of Applied Mathematics and Control
of the Moscow Institute for Physics and Technology in 1985. He obtained his
Doctor of Science Degree in Theoretical Computer Science at the Computing Center
of Russian Academy of Science. Since 2006, he is the Head of Department for
Data Analysis and Artificial Intelligence, Head of the International Laboratory
for Intelligent Systems and Structural Analysis, and Academic Supervisor of the
Data Science master program at National Research University Higher School of
Economics (Moscow). His research interests are in the algorithms of data mining,
knowledge discovery, and formal concept analysis.
xxxv
Optimal Layered Defense For Site
Protection
1 Introduction
2 Mathematical Model
We concentrate in this paper on the problem with two layers of defense, where
each security layer has a number of sensors placed on possible paths of incoming
illegal flow of vehicles and/or patrons. Inner layers are composed of sensors that are
used to detect units that have managed to infiltrate outer layers undetected. Every
interior sensor is connected to one or more sensors in the immediately preceding
outer layer in the sense that it is responsible for backing up those sensors, i.e., the
goal of the interior sensor is to discover traffic which has remained undetected by
those outer sensors. In order for an illegal unit to penetrate the system, it would
need to remain unnoticed at all layers of inspection. We denote the set of sensors
in the internal layer of defense with I , and the set of sensors in the external layer
with J . We assume that sensors in the set I share a limited total resource budget X
and sensors in the set J share a limited total budget Y . More subtle models allow
one to make decisions about how much budget to allocate between the inside and
outside layers. Our objective is to develop optimization methods to determine the
optimal allocation of resources to security sensors in such a manner that the expected
detection rate of incoming threats is maximized. For modeling purposes we employ
the following assumptions:
• There exists only one type of violation that we are protecting against.
• The expected number of contraband units on each incoming path is a known
parameter. For the outermost perimeter sensor j , we denote the incoming
contraband flow with .Fj .
• For each sensor .i ∈ I , located at the inside security layer, we know the function
x
.D (x), which specifies the detection rate at the sensor for contraband items if
i
the total amount of resources made available to the sensor is x, and similarly
y
for each .j ∈ J we know the detection function .Dj (y). In this paper we assume
that the detection functions are specified as concave increasing piecewise linear
functions. Thus, we do not require the detection functions to be differentiable
everywhere, which is an important property of our method. We also assume that
the resources are normalized to take values between 0 and 1.
• All sensors at a given layer share a limited common resource. For example,
an outside perimeter could be supported by a fixed number of infrared motion
detectors or license plate readers, while an inside perimeter could consist of
walkthrough magnetometer tests or security guards conducting wanding of
patrons.
Our goal is to allocate the total outside resources among individual sensors and
allocate the total inside resources among individual sensors in order to maximize
the detected illegal flow. Thus we arrive at the following mathematical formulation:
4 T. Asamov et al.
⎧⎛ ⎞ ⎛ ⎞⎫
⎨ ⎬
max ⎝ Fj · Dj (yj )⎠ + Dix (xi ) ⎝
y
Fj (1 − Dj (yj ))⎠
y
x,y ⎩ ⎭
i∈I j ∈N (i) j ∈N (i)
s.t. xi ≤ X
i∈I
.
(1)
yj ≤ Y
j ∈J
xi ≥ 0, ∀i ∈ I
yj ≥ 0, ∀j ∈ J
where .N (i) denotes the set of outside sensors adjacent to inside sensor i. Here, the
first sum over the outside neighbors j of i gives the flow that is captured at j and
the second sum gives the flow that is not captured at j but is captured at i.
Now, let us examine the given objective function. Clearly, it contains mixed
nonlinear product terms of detection probabilities. Moreover, since there are no pure
quadratic terms, we know that in general the objective function is neither convex,
nor concave. We illustrate this in Example 2.1.
Example 2.1 (Indefinite Objective Function) Suppose we have a single exterior
sensor j preceding a single interior sensor i, as shown in Fig. 2. In this case, we
would need to solve the following problem.
y y
max Fj · Dj (yj ) + Dix (xi ) · (Fj · (1 − Dj (yj )))
xi ,yj
(2)
s.t. 0 ≤ xi ≤ X
.
0 ≤ yj ≤ Y
y
Let us for a moment consider what would happen if .Dix (x) = x and .Dj (y) = y as
y
shown in Fig. 3. In that case, .Dix and .Dj are differentiable everywhere. Thus, if we
denote the objective function in problem (2) as
y y
.f i,j (xi , yj ) = Fj · Dj (yj ) + Dix (xi ) · (Fj · (1 − Dj (yj )))
then we know
Fig. 2 A model of layered defense showing a target with a single exterior sensor preceding a
single interior sensor
Optimal Layered Defense For Site Protection 5
Fig. 3 Linear detection rates at both the exterior and interior sensors of Fig. 2
⎡ ⎤
∂f i,j (xi ,yj )
∇f i,j (xi , yj ) = ⎣ ∂xi
∂f i,j (xi ,yj )
⎦
∂yj
y
Fj (1 − Dj (yj ))
=
.
Fj (1 − Dix (xi )) (3)
Fj (1 − yj )
=
Fj (1 − xi )
0
≥
0
And its two eigenvalues are .λ1 = Fj and .λ2 = −Fj . Hence we know that the
Hessian matrix associated with the quadratic terms is indefinite.
The indefiniteness of the Hessian presents a major obstacle to solving the
problem with standard solvers for quadratic programming. In our study, we tried
solving numerous instances using different methods implemented in the MATLAB
optimization toolbox. While in some cases we were able to produce consistent
output, none of the examined methods were able to overcome the indefiniteness
of the Hessian matrix for all possible values of the input parameters. This created
the need for the development of an alternative solution method for the problem.
6 T. Asamov et al.
y
Fig. 4 A plot of the objective function of (2) for the case of .Dix (xi ) = xi , Dj (yj ) = yj and
.Fj = 1. In this case three of the corners of the feasible region are optimal solutions
However, it is sufficient to discretize the parameter space for the interior sensors.
Then, for each fixed set of values, we can find the optimal configuration of
the exterior perimeter by solving a linear programming problem. If we take this
approach, then the two above mentioned instances require, respectively, the solution
of .108 and .1030 small linear programming problems. While this is a significant
improvement, we are still subjected to the curse of dimensionality as the number
of sensors in the interior perimeter increases. Fortunately, we can overcome this
challenge.
To illustrate the idea behind our method, we consider the following basic example.
y
Suppose we would like to solve problem (2) for the case when .Dix and .Dj are
general piecewise linear functions, as illustrated in Fig. 5.
In that case, the objective function .f i,j (xi , yj ) of problem (2) is still continuous.
Further, .f i,j (xi , yj ) is differentiable everywhere except at points corresponding to
y
corner points of the detection functions .Dix and .Dj . Moreover, at points where
.f
i,j (x , y ) is differentiable, its gradient has the form
i j
∂f (xi ,yj )
∂xi
∇f i,j
(xi , yj ) = ∂f (xi ,yj )
∂yj
y
ci Fj (1 − Dj (yj )) (5)
.
=
cj Fj (1 − Dix (xi ))
0
≥
0
Fig. 5 Piecewise linear detection rates at both the exterior and interior sensors of Fig. 2
8 T. Asamov et al.
for some constants .ci , cj ≥ 0. Thus, the optimal solution is again obtained by setting
xi = X and .yj = Y .
.
We can solve an instance of problem (6) in .O(1) time by setting .xi = Xi and
.yj = Yj . Thus, we can compute .T i,j (Xi , Yj ) for each .(Xi , Yj ) ∈ X × Y in
.O(|X ||Y|). Hence, we can plot the objective function in (2) with arbitrary precision
which would ultimately allow us to solve problem (1) with arbitrary precision (see
Fig. 6).
Step 3
Now, suppose that instead of a single outside sensor j , we consider two outside
sensors .j1 and .j2 that are both backed up by inner sensor i (see Fig. 7). We use
.{j1 , j2 } to denote quantities that refer to the combined system with two outside
sensors .j1 and .j2 . For example, .T i,{j1 ,j2 } denotes a table of optimal detection values
for the combined system of inside sensor i and two outside sensors .j1 and .j2 .
Further, we also use .Xi,{j1 ,j2 } and .Y i,{j1 ,j2 } to denote the amount of inside and
outside resources budgeted to the sensor system of inner sensor i, and outer sensors
.j1 and .j2 . Please recall that in Step 2, we computed the two tables of optimal
values .T i,j1 and .T i,j2 (considering first the problem with only sensors .i, j1 and
then the problem with only sensors .i, j2 (see Fig. 8)). Since they both share the
same inner sensor, all we need to do in order to find the optimal detection value
.T
i,{j1 ,j2 } (X i,{j1 ,j2 } , Y i,{j1 ,j2 } ) for the combined system is to determine the optimal
way to allocate .Y i,{j1 ,j2 } between sensor .j1 and .j2 . Thus we have the problem of
optimizing the following formulation:
Optimal Layered Defense For Site Protection 9
y
Fig. 6 A plot of the objective function of (2) for piecewise linear functions .Dix (xi ) and .Dj (yj )
.T i,{j1 ,j2 } Xi,{j1 ,j2 } , Y i,{j1 ,j2 } =
⎧ ⎫
⎪
⎪ max T i,j1 Xi,{j1 ,j2 } , y i,j1 + T i,j2 Xi,{j1 ,j2 } , y i,j2 ⎪
⎪
⎪
⎪ y i,j1 ,y i,j2 ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ s.t. ⎪
⎪
⎨ ⎬
= y i,j1 + y i,j2 ≤ Y i,{j1 ,j2 } (7)
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ y i,j1
∈ Y ⎪
⎪
⎪
⎪ ⎪
⎪
⎩ i,j2 ⎭
y ∈Y
i,j i,j
Notice that even though we consider all different values of .yj 1 , yj 2 ∈ Y,
problem (7) can be solved in time linear in the cardinality of .Y. This is accomplished
by using two index variables initialized at the two ending points of the outside
resource partition .Y.
If we have three outside sensors .j1 , j2 , j3 corresponding to inside sensor i, then
we can find their solution matrix .T i,{j1 ,j2 ,j3 } as follows. Once we have computed
10 T. Asamov et al.
Fig. 8 Finding separate solutions for inner sensor i with each outside sensor .j1 and .j2
the matrix .T i,{j1 ,j2 } we use it as an input to Eq. (7), together with the matrix .T i,j3
to generate the solution matrix .T i,{j1 ,j2 ,j3 } . By recursion, we can solve a problem
instance that involves one inner sensor i and any number of outside sensors. We
denote with .T i the solution table of optimal values corresponding to inner sensor i
together with all of its adjacent outside sensors .j ∈ N (i).
Step 4
Suppose we are given two matrices, .T i1 and .T i2 , that correspond respectively to two
inner sensors .i1 and .i2 with their adjacent outside sensors. In order to determine the
optimal detection value .T {i1 ,i2 } (X{i1 ,i2 } , Y {i1 ,i2 } ) for the combined system, we need
{i ,i }
to find the optimal way to allocate .Xi 1 2 between .T i1 and .T i2 . Thus we consider
the problem of optimizing the following,
Optimal Layered Defense For Site Protection 11
⎧ ⎫
⎪ max T i1 x i1 , y i1 + T i2 x i2 , y i2 ⎪
⎪
⎪ ⎪
⎪
⎪
⎪
x i1 ,x i2 ,y i1 ,y i2 ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪
s.t. ⎪
⎪
⎪
⎪ ⎪
⎪
⎨ i1 {i1 ,i2 } ⎬
{i1 ,i2 } {i1 ,i2 } {i1 ,i2 } x +x ≤X i2
.T X ,Y = (8)
⎪
⎪ ⎪
⎪
⎪
⎪
⎪ y i1 + y i2 ≤ Y {i1 ,i2 } ⎪
⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ x i1
, x i2
∈ X ⎪
⎪
⎪
⎪ ⎪
⎪
⎩ i1 i2 ⎭
y ,y ∈ Y
We point out that problem (8) can be solved in .O(|X ||Y|) time. Finally, once we
have computed .T {i1 ,i2 } , we can proceed by recursion to solve problems involving an
arbitrary number of interior and exterior sensors.
5 Running Time
6 Convergence
Consider the objective function .f i,j (·, ·) of the system consisting of inner
sensor i and its external neighbor j . Since .f i,j is continuous, as well as quadratic
everywhere except for a set of measure zero, we know that .f i,j is Lipschitz
continuous and we denote its Lipschitz constant with .Li,j . If we choose .L ∈ R
such that
L = max Li,j
.
i∈I
j ∈N (i)
up the inside resources .Xi,{j1 ,j2 } between .t i,j1 and .t i,j2 , while .y i,j1 and .y i,j2 denote
the optimal way to split up the outside resources .Y i,{j1 ,j2 } between .t i,j1 and .t i,j2 .
On the other hand, we would use .Xi,j1 and .Xi,j2 to denote the optimal way to
split up the inside resources .Xi,{j1 ,j2 } between .T i,j1 and .T i,j2 , while .Y i,j1 and .Y i,j2
denote the optimal way to split up the outside resources .Y i,{j1 ,j2 } between .T i,j1 and
.T
i,j2 . Before we proceed, we need to introduce the following notation. We use . x
to denote the point in .X that is closest to x from below, and we use . y to denote
the point in .Y that is closest to y from below. Similarly, we use . x to denote the
point in .X that is closest to x from above, and we use . y to denote the point in .Y
that is closest to y from above. Then the discrete approximation error can be written
as,
i,{j1 ,j2 }
. ξ Xi,{j1 ,j2 } , Y i,{j1 ,j2 }
= t i,{j1 ,j2 } Xi,{j1 ,j2 } , Y i,{j1 ,j2 } − T i,{j1 ,j2 } Xi,{j1 ,j2 } , Y i,{j1 ,j2 }
= t i,j1 x i,j1 , y i,j1 + t i,j2 x i,j2 , y i,j2
−T i,j1 Xi,j1 , Y i,j1 − T i,j2 Xi,j2 , Y i,j2
≤ t i,j1 x i,j1 , y i,j1 + t i,j2 x i,j2 , y i,j2
−T i,j1 x i,j1 , y i,j1 − T i,j2 x i,j2 , y i,j2
= t i,j1 x i,j1 , y i,j1 + t i,j2 x i,j2 , y i,j2
−t i,j1 x i,j1 , y i,j1 − t i,j2 x i,j2 , y i,j2
= t i,j1 x i,j1 , y i,j1 − t i,j1 x i,j1 , y i,j1
+ t i,j2 x i,j2 , y i,j2 − t i,j2 x i,j2 , y i,j2 (9)
Optimal Layered Defense For Site Protection 13
√
Since we have . t i,jk x i,jk , y i,jk − t i,jk x i,jk , y i,jk ≤ 2L for both .k =
1, 2 the bound
√
i,{j1 ,j2 }
.ξ Xi,{j1 ,j2 } , Y i,{j1 ,j2 } ≤ 2 2L (10)
follows. Now, we can also bound the error in the case of three outside sensors:
ξ i,{j1 ,j2 ,j3 } Xi,{j1 ,j2 ,j3 } , Y i,{j1 ,j2 ,j3 } =
= t i,{j1 ,j2 ,j3 } Xi,{j1 ,j2 ,j3 } , Y i,{j1 ,j2 ,j3 }
− T i,{j1 ,j2 ,j3 } Xi,{j1 ,j2 ,j3 } , Y i,{j1 ,j2 ,j3 }
= t i,{j1 ,j2 } x i,{j1 ,j2 } , x i,{j1 ,j2 } + t i,j3 x i,j3 , y i,j3
− T i,{j1 ,j2 } Xi,{j1 ,j2 } , Y i,{j1 ,j2 }
− T i,j3 Xi,j3 , Y i,j3
≤ t i,{j1 ,j2 } x i,{j1 ,j2 } , y i,{j1 ,j2 } + t i,j3 x i,j3 , y i,j3
− T i,{j1 ,j2 } x i,{j1 ,j2 } , y i,{j1 ,j2 }
.
(11)
− T i,j3 x i,j3 , y i,j3
≤ t i,{j1 ,j2 } x i,{j1 ,j2 } , y i,{j1 ,j2 } + t i,j3 x i,j3 , y i,j3
− t i,{j1 ,j2 } x i,{j1 ,j2 } , y i,{j1 ,j2 }
√
− t i,j3 x i,j3 , y i,j3 + 2 2L
= t i,j1 x i,j1 , y i,j1 − t i,j1 x i,j1 , y i,j1
+ t i,j2 x i,j2 , y i,j2 − t i,j2 x i,j2 , y i,j2
√
+ 2 2L
√ √ √
≤ 2L + 2L + 2 2L
√
= 4 2L
√
ξ i ≤ 2 2(|J | − 1)L, ∀i ∈ I
.
since an inside sensor can have at most .|J | adjacent outside sensors.
Now, suppose that .i1 , i2 ∈ I, i1 = i2 . Then,
ξ {i1 ,i2 } X{i1 ,i2 } , Y {i1 ,i2 }
= t {i1 ,i2 } X{i1 ,i2 } , Y {i1 ,i2 } − T {i1 ,i2 } X{i1 ,i2 } , Y {i1 ,i2 }
= t i1 x i1 , y i1 + t i2 x i2 , y i2 − T i1 Xi1 , Y i1 − T i2 Xi2 , Y i2
≤ t i1 x i1 , y i1 + t i2 x i2 , y i2 − T i1 x i1 , y i1 − T i2 x i2 , y i2
≤ t i1 x i1 , y i1 + t i2 x i2 , y i2
− T i1 x i1 , y i1 − T i2 x i2 , y i2
. (12)
= t i1 i1
x , y i1
−T i1 i1
x , y i1
+ t i2 x i2 , y i2 − T i2 x i2 , y i2
+ t i1 x i1 , y i1 − t i1 x i1 , y i1
+ t i2 x i2 , y i2 − t i2 x i2 , y i2
√ √ √
≤ 2 2 (|J | − 1) L + 2 2 (|J | − 1) L + 2 2L
√
≤ 4 2 (|J |) L
We can also bound the error in the case of three inside sensors and all of their
adjacent outside sensors:
≤ t {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 } + t i3 x i3 , y i3
−T {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 } − T i3 x i3 , y i3
= t {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 } − T {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 }
+ t i3 x i3 , y i3 − T i3 ( x i3 , y i3 )
+ t {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 } −t {i1 ,i2 } x {i1 ,i2 } , y {i1 ,i2 }
+ t i3 x i3 , y i3 −t i3 ( x i3 , y i3 )
√ √ √
≤ 4 2|J |L + 2 2(|J | − 1)L + 2 2L
√
≤ 6 2(|J |)L (13)
Therefore,
=0
So far, our model assumed a fixed flow of dangerous material on each pathway,
and we have presented a method that would allow law enforcement officials to
use current information on attacker behavior to maximize the amount of captured
illegal or dangerous contraband. However, we can think of attackers as intelligent
adversaries who would adjust their strategy once they observe the changes in site
security. Therefore, the goal of a defensive strategy could be to make sure that no
path leading into the site has a violation detection rate that is unreasonably low. For
16 T. Asamov et al.
xi ≥ 0, ∀i ∈ I
yj ≥ 0, ∀j ∈ J
In order to solve this problem we can use a similar approach to the one discussed in
the previous section.
Steps 1 and 2
These are identical to their counterparts described in Sect. 4, with .Fj = 1 for every
outside sensor j .
Step 3
Once again, we denote by .T i,j the resulting table of values generated at Step 2.
More specifically, we denote with .T i,j (Xi , Yj ) the optimal detection value that can
be achieved by investing .(Xi , Yj ) ∈ X ×Y resources of respectively, inner and outer
resources. Then we consider the case of two outside sensors .j1 , j2 that are backed
up by inner sensor i. We can merge .T i,j1 and .T i,j2 into a single solution according
to:
T i,{j1 ,j2 } Xi,{j1 ,j2 } , Y i,{j1 ,j2 } =
⎧ ⎫
⎪
⎪ max min T i,j1 Xi,{j1 ,j2 } , y i,j1 , T i,j2 Xi,{j1 ,j2 } , y i,j2 ⎪⎪
⎪
⎪ ⎪
⎪
⎪
⎪ y i,j1 ,y i,j2 ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ s.t. ⎪
⎪
.
⎪
⎨ ⎪
⎬ (16)
= y i,j1 + y i,j2 ≤ Y i,{j1 ,j2 }
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪
⎪ i,j1
⎪ ⎪
⎪
⎪
⎪ y ∈ Y ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎩ i,j2 ⎪
⎭
y ∈Y
where .T i,{j1 ,j2 } Xi,{j1 ,j2 } , Y i,{j1 ,j2 } is the optimal detection value for the combined
system.
Optimal Layered Defense For Site Protection 17
8 Computational Results
In this section we present computational results for the methods developed above.
The experiments were performed on an AMD Phenom X4 9550 workstation with
6GB of DDR2 RAM. We consider two different system configurations, and for each
of them we provide plots of the objective function value for both the original and
adaptive adversary models.
Example 8.1 In this example we consider an inner layer consisting of four sensors,
one with three adjacent outside sensors (indices 1, 2, 3), a second with two adjacent
outside sensors (indices 4, 5), a third with two adjacent outside sensors (indices 6, 7),
and a fourth with two adjacent outside sensors (indices 8, 9). For each inside sensor
.i ∈ I , we specify .D (x) = min{0.2x, 0.4 + 0.1x}. Further, for the first outside
x
i
18 T. Asamov et al.
Fig. 9 Solution maximizing the expected amount of captured contraband for a range of interior
and exterior budgets for Example 8.1
y
sensor we use .Dj1 (y) = min{0.3y, 0.3+0.1y, 0.5+0.05y}, and for outside sensors
y
of index .j = 2, 3, . . . , 9, we use .Dj (y) = min{0.3y, 0.3 + 0.1y}. In addition, all
outside sensors have exactly 1 unit of incoming flow. Figure 9 gives the solution
maximizing the expected amount of captured contraband for a range of interior and
exterior budgets, i.e., the solution to the first problem. The solution matrix includes
10,302 distinct points and the computation took 117 seconds.
We can also calculate the adaptive adversary solution maximizing the minimum
probability of capturing contraband along all paths for a range of interior and
exterior budgets. The solution shown in Fig. 10 includes 40,401 distinct points and
the computation took 3102 seconds (52 minutes).
Example 8.2 In this example, we modify Example 8.1 so that all the outside sensors
have exactly 1 unit of incoming flow except for outside sensors 1 and 9 which
have 10 units of incoming flow. Figure 11 shows the optimal objective values of
the maximized amount of captured contraband for a range of interior and exterior
budgets. The solution table includes 10,302 distinct points, and the computation
took 119 seconds.
In this example we only changed the flow values of Example 8.1. For this reason,
we do not need to compute a new adaptive adversary solution, as it would be
identical to the one for Example 8.1. Naturally, this illustrates the robustness of
the adaptive adversary formulation compared to its original counterpart.
Optimal Layered Defense For Site Protection 19
Fig. 10 Adaptive adversary solution maximizing the minimum probability of capturing contra-
band along all paths for a range of interior and exterior budgets for Example 8.1
Fig. 11 Solution maximizing the expected amount of captured contraband for a range of interior
and exterior budgets for Example 8.2
9 Closing Remarks
We have considered the problem of determining the optimal resource allocation for
layered security. A computational method for the maximization of captured contra-
20 T. Asamov et al.
band and an adaptive adversary approach for the maximization of the worst case
probability of detection have been developed. Both methods are computationally
tractable and can be applied to non-trivial practical problems.
We have a great deal more that we can do in the future. One thing is to consider
both legal and illegal flow, which we also refer to as respectively good and bad
units. Hence, in addition to detecting bad units we could consider false positive
decisions for each sensor and adopt a risk-averse optimization approach [3]. Another
possible direction would be attempting to write the problem as a large game and use
approximation methods similar to the ones developed by Grigoriadis and Khachian
[10]. Alternatively, we could look into interdiction on planar graphs methods similar
to the ones developed by Zenklusen [18, 19].
Still another approach is to follow the applications of Stackelberg games that
have been used in pioneering defensive approaches at the nation’s airports, ports,
and in applications by the Federal Air Marshals Service, US Coast Guard, etc. (see
[11, 15]). In a Stackelberg game between an attacker and a defender, the defender
(security) acts first. The attacker can observe the defender’s strategy and choose
the most beneficial point of attack. The challenge is to introduce some randomness
in the defender’s strategy to increase the uncertainty on the part of the attacker.
Bayesian Stackelberg games do exactly that. Layered defense makes this into a new
kind of Stackelberg game to analyze, one with two rounds, one involving the outer
layer and one involving the inner layer based on results at the outer layer. We can
look both at nonrandomized and randomized strategies for the defender.
There are many other directions in which this work could go. Even with our
current model, we have not yet developed practical methods to handle more than
two layers of defense. There are also many variations on our model that could
be quite interesting. For example, we could consider a fixed resource limit that
the defender could allocate between inner and outer layers. Then, we could allow
adaptive redistribution of resources across layers and across time (see [2, 4]).
Disclaimer The views and conclusions contained in this document are those of the authors and
should not be interpreted as necessarily representing the official policies, either expressed or
implied, of the U.S. Department of Homeland Security.
Acknowledgments This material is based upon work supported by the U.S. Department of
Homeland Security under Grant Award Numbers 22STESE000010101 from DHS Office of
University Programs via Northeastern University, DHS 2009- ST-061-CCI002 from DHS Office
of University Programs to Rutgers University, and HSHQDC-13-X-00069 from DNDO. PBK
acknowledges support from the NSF Grants Numbered 1247696 and 2041428, and the Department
of Industrial and Systems Engineering, University of Wisconsin, Madison.
References
1. Anand, S., Madigan, D., Mammone, R., Pathak, S., Roberts, F.: Experimental analysis of
sequential decision making algorithms for port of entry inspection procedures. In: Intelligence
and Security Informatics, pp. 319–330. Springer (2006)
Optimal Layered Defense For Site Protection 21
Alexander S. Belenky
1 Introduction
Distance learning is one of the rapidly growing directions in higher education, and,
according to numerous reports, only in the U.S., it currently embraces millions of
students [1]. Higher education specialists continue to discuss various philosophical
and pedagogical problems of distance learning, including those related to the quality
of education that it may provide (compared to a traditional off-line higher education)
[2]. At the same time, the distance learning economics becomes one of the areas
which starts to concern administrations of all the colleges and universities at which
the courses available via the Internet are or are planned to be in use [3].
One of the topics that is actively discussed in scientific publications in the frame-
work of the economics of distance learning deals with specifics of the so-called
blended courses [4–7]. A group of such courses is formed by those incorporating
recorded fragments from particular courses available on the Internet into lectures
and seminars that are taught by college/university professors in classrooms offline.
The available data on studies analyzing the effectiveness of various forms of dis-
tance learning in general suggest that from the viewpoint of the quality of education,
blended courses can be not less effective than recorded ones [8, 9]. Moreover, the
so-called “peer effect”—which is inevitably present in all the blended courses—
apparently, positively affects this quality [10, 11]. However, running a blended
course on a particular subject may require the college/university administration to
provide more professors and teaching assistants than it would provide for an offline
A. S. Belenky ()
Department of Mathematics, Faculty of Economic Sciences and International Centre of Decision
Choice and Analysis, The National Research University Higher School of Economics, Moscow,
Russia
e-mail: [email protected]
course on the same subject. Yet, in most cases, such an expansion of the teaching
personnel can be done only within financial boundaries of the college/university
budget.
Meeting desirable quality requirements within certain financial restrictions is
a typical decision-making problem that is solved in industrial, transportation,
agricultural, business, financial, and other systems (See, for instance [12–18].)
These systems have long been using decision support tools letting their users find
the best allocation of those financial resources that they can afford to spend in order
to be competitive in corresponding markets. It seems that the administration of every
college/university would benefit from having similar tools at its disposal to meet the
challenges that the education market of potential new students currently poses and
will pose in the future. This is the case since all the colleges/universities compete,
particularly, for new students interested in good higher education.
One should draw attention of college/university administrations to the fact that
if properly advertised, an obligation to offer blended courses that are based on
recorded lectures and seminars of professors from the most famous universities in
the world is a factor positively affecting the student enrollment. Moreover, adopting
such a strategic decision of providing thus designed and advertised blended courses
is likely to affect positively the college/university position in the education market
in general, along with its budget, due to at least two reasons.
First, almost all the potential college/university new students who are interested
in acquiring knowledge (rather than only a diploma) dream on studying at Harvard,
MIT, Cambridge, Oxford, Princeton, Stanford and other world-famous universities.
Yet, many of these students are unable to study there due to either low high
school grades or financial reasons (or both). If a particular college/university could
convince potential new students of the above-mentioned kind that they would listen
to the same lecturers on most of the courses as do students from these famous
universities while being (a) offered effective tutorials prior to the start of every
course, and (b) explained all the course nuances in a simple manner understandable
to them, this would make a difference in making enrollment decisions by these
potential new students that are favorable to this college/university.
Second, if all such potential college/university new students were offered to study
both these courses and tutorials at an affordable cost—which may, eventually, be
lower than the cost of studying the same subjects off line at other colleges/universi-
ties—their intent to enroll at this college/university would become even stronger.
Thus, the question that a college/university administration—interested in running
blended courses there in principle—needs to answer is: How to make the above
education strategy a reality, and how to convince potential new students to enroll at
this college/university?
Two sets of problems are to be addressed by the interested college/university to
answer this question.
The first set includes problems associated with finding such a structure of each
particular blended course (which is based on the above-mentioned recordings) that
would help the college/university administration convince the potential new students
that from the very first lectures, they would become sure to succeed in studying the
Developing and Running a Set of Competitive College/University Blended Courses 25
course. (Certainly, the administration should make it clear that this success can come
only if the enrolled students (a) strictly follow recommendations of the teaching
personnel assigned to run this course, and (b) bend every effort to succeed.)
The second set includes problems associated with financial aspects of organizing
and running blended courses in a manner making it financially affordable to both
the college/university and the students.
Addressing problems from the first set (associated with choosing the structure of
each blended course) implies conducting comprehensive studies on how different
categories of students focus on both the subject of a lecture in general and particular
information or techniques discussed in the course. Also, the studies are to determine
the best way to run the lecture by choosing an optimal sequence of fragments from
the recorded courses and the explanations of the scope of these fragments that
should either precede a particular fragment or immediately follow it. These studies
should lead to designing testing questions to be asked by the teacher running the
course before going to the next fragment. However, all this should be done in a
manner that would not turn the lecture into a discussion with the audience that
may consume much of the lecture time. Finally, the structure of tutorials offered
by the college/university to let the students succeed in studying each blended course
designed in the chosen manner should be determined and announced in advance.
Published recommendations of the teachers, along with, possibly, even special
consultations of these teachers, can make a difference in the effectiveness of such
study results. Certainly, the experience and pedagogical skills of those to be chosen
to run blended courses matter a great deal. It’s especially so if every teacher chosen
to teach a blended course manages to test a particular version (or even different
particular versions) of this blended course that she/he proposes to run on groups of
the students similar (or at least close) to those who are expected to study the course
at the college/university.
Addressing problems from the second set implies the determination of
(a) the minimal budget to organize and run the teaching of a set of chosen blended
courses to secure such a percentage of the students (who are to study courses
from this set) expected to succeed in studying each particular course from the
set that would not be lower than a certain desirable one, and
(b) the maximal percentage of the students (who are to study blended courses from
this set of the chosen courses) expected to succeed in studying all the courses
from the set under a particular budget.
Though with respect to the first set of problems, there is a room (if not a
necessity) for the use of mathematical methods, these problems are currently
discussed in scientific publications on this subject mostly at the level of hypotheses
[19–21]. The authors of these publications either only set or set and verify the
proposed hypotheses by conducting polls among both the students and the teachers.
Problems from the second set are rarely considered in scientific publications at all,
and no helpful quantitative analysis of these problems have so far been offered.
The latter leaves college/university administrations “unarmed” in dealing with the
26 A. S. Belenky
run particular blended courses, both those from abroad and from other universities
in the country, these problems don’t seem to have been studied quantitatively in
scientific publications. As far as the author is aware, only qualitative considerations
of the above two aspects of online education, both in general and with respect
to blended courses, have so far been presented either in the form of case studies
or in that of open discussions. As one can see from the publications cited in the
Introduction, papers covering at least five important aspects of education related to
blended learning—(a) online distance learning advantages in general, (b) blended
courses specifics, (c) the effectiveness of blended learning, (d) peer-effect in blended
learning, and (e) choosing the structure of blended learning—are those on only
qualitative aspects of the corresponding topics.
The aim of the review presented below is to give the reader an impression on
(a) the substance of the topics covered by contemporary publications on blended
learning, and (b) the state of affairs in this field, which has motivated the proposed
quantitative analysis of the problem of using available online courses in designing
blended ones.
Online Distance Learning Problems An important mission of online education—
which is close to the topic of the present paper—consists of providing opportunities
for bringing a high quality of education offered by the most prestigious col-
leges/universities in the world to colleges/universities of a (currently) modest level
of education quality. By reducing their expenses, the latter colleges/universities
may be able to make a higher level of education quality financially affordable
to both them and their students by widely using online courses developed by
leading colleges/universities in the world [3], particularly, by incorporating recorded
fragments from these online courses into blended courses that they can offer. To
implement this mission, as well as other social missions of online education, two
groups of problems associated with distance learning are to be studied.
The first group of these problems includes those associated with “distance learn-
ers” such as communications among online learners [2], approaches to structuring
assessments of studying parts of an online course, which affect student’s learning
strategies [22], creating appropriate social environments affecting the motivation
to learn [23] and encouraging the presence of the so-called “massive learning”
phenomenon [24], choosing the structure of instructions provided in all forms of
online learning [25], along with teacher-student relationships that affect student’s
achievements [26], the effectiveness of online learning in the form of discussions
vs. so-called “silent learning” [27], pedagogical aspects of the knowledge perception
by the learners and the behavior of participants of the online education, along with
micro- and macro-level environment surrounding this activity [28].
The second group of the problems includes those associated with teachers
and administrations of colleges/universities involved in running distance learning
courses such as preparing teachers for running online courses [29], encouraging
teachers to switch to teaching online [30] or/and to using online materials in blended
educational courses, particularly, by providing information on the effectiveness
of online education [31] and comparing this effectiveness with that of traditional
28 A. S. Belenky
Patterns of the students’ evaluation of and experience with online, blended, and
face-to-face courses are compared in [47].
The Effectiveness of Blended Learning A case study related to implementing a
blended learning approach to developing a course is presented in [48], where
both students’ responses to the blended learning environment and thoughts of
the course author on the requirements that a successful blended course should
meet are discussed. An approach to integrating online and face-to-face learning
with blended learning, which was successfully implemented at Suleyman Demirel
University via a University Learning Management Systems (LMS) with respect to a
particular computer engineering course, is described in [9]. A set of bottlenecks
in contemporary higher education, which is described in [49], is considered by
the authors, particularly, from the viewpoint of the ability of blended courses
to contribute to effectively solving corresponding problems of managing higher
education.
Twenty research studies analyzed in the framework of a review, presented in
[50], let the review authors suggest that there exist two groups of main factors,
each affecting the solving of creative problems in a blended learning environment,
that should be considered. The first group includes four particular factors affecting
the solving process, whereas the remaining five factors, determining the blended
learning environment, form the second group. Results of an experiment with
the MAgAdI on-line system—an adaptive, integrated into the learning process
web environment, developed to support the learning processes in which several
knowledge fields, courses, and teachers get involved—are discussed in [8]. The
authors of [8] assert that the use of that on-line system helps successfully blend
traditional teaching methods with various learning environments.
Peer Effects in Learning Peer effects—as a phenomenon associated with an
affection that the interaction with peers may have on a person’ behavior in a
group learning—have some general features. This is why these effects are currently
studied mostly in general though they may have certain specifics in blended learning
and affect compositions of learners (who take blended courses) to achieve better
academic results.
Fundamentals of peer effects in higher education, including the definition of
this phenomenon, are discussed in [10], where the authors assert that these effects
affect the economics of higher education, mostly since (according to these authors),
they “eliminate awkward anomalies in the institutional behavior of colleges and
universities and in the economic structure of higher education as an industry if they
exist.” Three interaction patterns of peer talks—a tutor-led question-and-answer
pattern, a cumulative-exploratory pattern, and a dispute-exploratory one—within
a particular educational environment in which third-year students are to train first-
year ones are identified in [51]. The authors of that paper believe that using the
identified patterns may improve the effectiveness of studies with respect to all the
subjects of learning.
The underlying mechanism of peer effects is studied in [11] with respect to a
group of first year students for which the academic abilities of different student
30 A. S. Belenky
subgroups are measured and compared. According to the authors of that paper, (a)
male, minority, and low-ability students are affected by their peers the most, and (b)
the transfer of general knowledge within all the subgroups has more effect than that
of specific knowledge.
A theoretical model attempting to explain (and compare) different peer effects
in different fields of studies based upon researching the behavior of the first year
students in a middle-sized public university in Italy is presented in [52]. Peer
effects in students’ academic performance are studied in [53] based upon a set of
publications in the field, and one of the findings, presented in that paper, suggests
that high ability students have higher positive peer effect on other high ability
students at both school and college/university level. At the college/university level,
the findings suggest that the background of roommates affects students’ academic
performance.
Choosing the Structure of a Blended Course An approach to structuring a blended-
learning bachelor program in electrical engineering, designed for a group of students
who study this subject alongside with working (as being employed), is discussed in
[21]. Several options for information exchange such as establishing a connection
with the course of mathematics, retrieving references needed to study the course,
electronic module questionnaires, and a feedback channel functioning during the
whole course are used in structuring the course. A methodology (Q-methodology)
to designing blended education courses is considered in [19] to identify learning
perspectives of the students enrolled, along with peculiarities of their perceptions
of the course, which helps estimate possible student achievements as a result of
studying the course. The author of [19] identifies four perception types and offers
his observations on the academic achievements of the students with some of those
perception types.
A mathematical thinking approach underlying the structure of a blended course
in part of a calculus course related to multivariable calculus is proposed in [54],
where the authors assert that creating a blended learning environment is adequate to
developing mathematical thinking in students. A set of guidelines that are based on
theories of instructional design for writing instructors, offered in [55], are aimed at
supporting effective student learning. The authors of that publication suggest that in
designing the course, the instructors should focus on five issues: (1) a substantiation
of the need for the course, (2) the structure of the audience to learn the course,
(3) advantages of developing the course in an on-line form, (4) basic pedagogical
principles, and (5) available resources.
A concept of blended learning, along with basic elements of its teaching mode
frame, is discussed in [56]. A set of guides, documents, and publications related to
practical aspects of designing blended courses, along with common principles of
designing such courses, an appropriate terminology, and strategies to use to meet
accreditation requirements, is considered in [57]. The authors of that publication
discuss approaches to using online technologies and to the course implementation,
along with the problems and difficulties to bear in mind in designing blended
courses, and outline directions of further research in this field. Detailed instructions
Developing and Running a Set of Competitive College/University Blended Courses 31
• .al be the basic yearly salary of teacher l to be invited from abroad to teach courses
from the set of K blended courses in English, .l ∈ 1, L,
• .lk be the additional yearly salary of teacher l, invited from abroad to teach
courses from the set of K blended courses in English, for teaching course .k, k ∈
1, K, l ∈ 1, L,
• .hl be the relocation cost associated with the invitation of teacher l from abroad
to teach courses from the set of K blended courses in English, .l ∈ 1, L,
• .di be the per hour salary of a (currently employed by the University) teacher, who
is to substitute course developer i (who is assigned to teach courses from the set
of K blended courses) in all the activities associated with teaching the courses
“vacated” by course developer i, .i ∈ 1, M (if there are such course developers),
• .tik be the number of hours per year that course developer i “vacates” (as a result
of switching to teaching course k from the set of K blended courses) that are
to be covered by other teachers (either by those who are currently employed at
the University or by those to be hired from other universities in the country),
.i ∈ 1, M, k ∈ 1, K,
• .xik be a Boolean variable that equals 1 if course developer i from the University is
assigned to teach blended course .k, i ∈ 1, M, k ∈ 1, K, and equals 0, otherwise,
• .sr be a Boolean variable that equals 1 if course developer r to be invited from
another university in the country is qualified to teach courses from the set of K
blended courses and equals 0, otherwise, .r ∈ 1, R,
• .yrk be a Boolean variable that equals 1 if course developer r from another
university in the country is invited to teach blended course .k, k ∈ 1, K, r ∈ 1, R
and equals 0, otherwise,
• .wl be a Boolean variable that equals 1 if teacher l from abroad is invited to teach
courses from the set of K blended courses and equals 0, otherwise, .l ∈ 1, L,
• .zlk be a Boolean variable that equals 1 if teacher l from abroad is invited to teach
blended course k and equals 0, otherwise, .l ∈ 1, L, k ∈ 1, K,
• .M(i) ⊂ 1, K be a subset of courses from the set of K blended courses that
course developer i (from the University) can’t teach, .i ∈ 1, M,
• .R(r) ⊂ 1, K be a subset of courses from the set of K blended courses that
course developer r (to be invited from another university in the country) can’t
teach, .r ∈ 1, R,
• .L(l) ⊂ 1, K be a subset of courses from the set of K blended courses that
teacher l (to be invited from abroad) can’t teach, .l ∈ 1, L,
• .νi be the maximal number of courses from the set of K blended courses that
course developer i (from the University) can teach concurrently, .i ∈ 1, M,
• .μr be the maximal number of courses from the set of K blended courses that
course developer r (to be invited from another university in the country) can
teach concurrently, .r ∈ 1, R,
• .πl be the number of courses from the set of K blended courses that teacher l (to
be invited from abroad) can teach concurrently, .l ∈ 1, L, and
• .qk be the yearly budget that the University can spend for additional salaries to
be paid to the teacher who teaches course k from the set of K blended courses
(along with the English language course and corresponding tutorials, if need be)
and to the one who substitutes this teacher (if there is one).
Assumptions
1. The numbers of potential candidates from other universities in the country to
choose from to (a) invite to become blended course developers, and (b) replace
teacher i for the hours “vacated” by this teacher are known large integers.
Developing and Running a Set of Competitive College/University Blended Courses 35
provided recorded materials on a selected group of its students to find out what
kinds of training courses and tutorials are to be run to help the students better
understand both the substance of the blended courses and the language to be
used in presenting (the designed) blended courses to the audience, which these
courses are prepared for.
M
R
L
xik + yrk + zlk = 1, k ∈ 1, K,
i=1 r=1 l=1
M
K
xik ≤ 1, k ∈ 1, K, xik ≤ νi , i ∈ 1, M,
i=1 k=1
xik = 0, k ∈ M(i) ⊂ 1, K,
K
yrk ≤ sr ≤ yrk , r ∈ 1, R, k ∈ 1, K,
k=1
R
K
yrk ≤ 1, k ∈ 1, K, yrk ≤ μr , r ∈ 1, R,
r=1 k=1
yrk = 0, k ∈ R(r) ⊂ 1, K,
.
L (1)
zlk − uk = 0, k ∈ 1, K,
l=1
K
zlk ≤ wl ≤ zlk , l ∈ 1, L, k ∈ 1, K,
k=1
L
K
zlk ≤ 1, k ∈ 1, K, zlk ≤ πl , l ∈ 1, L,
l=1 k=1
zlk = 0, k ∈ L(l) ⊂ 1, K,
L
R
lk zlk + δrk yrk +
l=1 r=1
M
M
∇ik xik + di tik xik + fk uk ≤ qk , k ∈ 1, K,
i=1 i=1
and
Developing and Running a Set of Competitive College/University Blended Courses 37
L
K
L
R
K
R
wl (al + hl ) + lk zlk + sr (br + gr ) + δrk yrk +
l=1 k=1 l=1 r=1 k=1 r=1
. (2)
K
M
K
M
K
∇ik xik + di tik xik + fk uk ≤ B − B0
k=1 i=1 k=1 i=1 k=1
should hold for the first year in the set of T years (also, see concluding Remark 9).
Taking into account the system of constraints (1) and (2), the two problems stated
earlier in this section of the paper can mathematically be formulated as follows:
Problem 1 The problem consists of maximizing the minimum percentage of the
total number of students expected to succeed in studying a course from the set .1, K
(when these courses are considered to be equally important from the University’s
viewpoint), and this problem is formulated proceeding from the fixed yearly budget
B (for T consecutive years). This problem is the Boolean programming problem
M
R
L
. min αik xik + βrk yrk + γlk zlk → max (3)
k∈1,K (xik ,yrk ,zlk ,uk ,sr ,wl )
i=1 r=1 l=1
L
K
L
R
K
R
wl (al + hl ) + kl zkl + sr (br + gr ) + δkr ykr +
l=1 k=1 l=1 r=1 k=1 r=1
. (4)
K
M
K
M
K
∇ik xik + di tik xik + fk uk → min
(xik ,yrk ,zlk ,uk ,sr ,wl )
k=1 i=1 k=1 i=1 k=1
under the system of constraints (1) and the additional system of constraints
M
R
L
. αik xik + βrk yrk + γlk zlk ≥ ωk , k ∈ 1, K. (5)
i=1 r=1 l=1
In both problems, it is assumed that the systems of constraints (1), (2) (in
Problem 1) and (1), (2), (5) (in Problem 2) are compatible, which can be verified,
and appropriate corrections in the systems of constraints (1), (2) can be made with
the use of the technique proposed, in particular, in [59].
38 A. S. Belenky
In Problems 1 and 2, it is assumed that the parameters .αik , βrk , and .γlk —which
are expert estimates of percentages of the students who are to take course .k ∈ K and
are expected to succeed in studying this course if the course is taught by a developer
from the University, or by a developer from another university in the country, or
by a teacher invited from abroad, respectively—are known numbers provided to the
University administration by expects recognized by the University.
4 Concluding Remarks
1. It’s clear that, generally, Boolean programming problems (1)–(3) and (1), (2),
(4), (5) can also serve as mathematical formulations of problems associated with
finding an optimal composition of the teachers to run any set of offline courses at
a college/university, for instance, in physics, mathematics, biology, etc., without
the use of recordings of any lectures on corresponding subjects. If this is the
case, any training courses and tutorials in the corresponding subjects may or
may not be run, and teachers from abroad may or may not be invited to teach the
corresponding offline courses. To formulate both problems in this case, some
variables and parameters, being present in the system of constraints (1), (2),
should be omitted.
Such a possibility to use the proposed models and Boolean programming
problems stems from the obvious observation: The nature of stuffing a set of
blended courses with teachers and that of stuffing a set of offline courses with
ones is the same, and so should be their mathematical formalization. Indeed, in
both cases, an optimal combination of the teachers from any available set of them
to teach courses within certain financial limits should be determined. However,
stuffing offline courses at, for instance, a modest university differs from stuffing
blended ones there.
Indeed, to make blended courses attractive to potential new students, the
college/university administration should
(a) properly advertise these courses as those to be taught with the use of
fragments of lectures given by professors from famous universities in the
world,
(b) convince the potential new students that both the tutorials and the training
courses to accompany each offered blended course (if there are ones to
accompany it) are designed to let every interested student succeed in studying
this course,
(c) pay extra salaries to the chosen teachers for developing each blended
course (in line with its structure to be provided by the college/unversity
administration), and
(d) apply a reliable procedure of estimating the ability of each potential teacher
to teach a blended course
Developing and Running a Set of Competitive College/University Blended Courses 39
to let the expert estimates of the percentage of the students expected to succeed
in studying this course be not lower than a particular desirable number (which is
known for standard offline courses on the same subjects).
Finally, the blended courses to be included in the college/university curricula
should be designed and run in such a manner that the students who are the
most successful in studying the offered blended courses may eventually become
prepared to continue their education in those world-famous universities, whose
recorded fragments of lectures were part of these blended courses. For instance,
upon graduation, bachelor students may succeed in getting admitted to these
universities for master programs, whereas master students may get admitted
for Ph.D. studies there. Needless to say that if this happened, it would be an
excellent contribution to the university’s prestige and would sharply contrasted
the described strategy of designing and running blended courses from that
of running traditional offline ones. The college/university may even organize
subgroups of the most advanced enrolled students interested in making these
moves to the world-leading universities to help them succeed in implementing
their desire. This can be done, particularly, by offering them specially designed
blended courses with a more intensive use of the recorded online lectures given
by distinguished professors at those universities.
2. While designing blended courses based on fragments of recorded lectures of
distinguished lecturers from world-leading universities is a good option for
advanced potential and current college/university students, blended courses for
both less prepared potential and current students may use recordings of ordinary
lecturers, including those offered by college/university teachers. In both cases,
problems (1)–(3) and (1), (2), (4), (5) (with changes reflecting a particular
strategy of designing and running blended courses) can be used for receiving
corresponding quantitative financial estimates.
3. The formulations of Problems 1 and 2 suggest that the proposed decision support
tool may help a college/university to substantiate the budget needs in talks with
legal entities that care about the quality of education that this college/university
provides and its prestige (such as local and federal administrations for public
colleges/universities and contributing sponsors for private ones). Particularly,
this tool helps the college/university substantiate the size of B − B0 —the
college/university’s budget that is planned by the college/university to be spend
for organizing and running K blended courses—particularly, in the talks with
these legal entities associated with the competitiveness of the college/university
in the market of new potential students.
4. In this paper, the competitiveness of a college/university in a market of potential
students is considered with respect to new potential students only. However, it’s
clear that, generally, the competitiveness of a college/university also depends
on how the education process is arranged for the students who have been with
this college/university at least for some time. Appropriate arrangements motivate
such students to complete their education where they are currently enrolled
rather than encourage them to transfer to another college/university to receive a
corresponding degree there. With respect to blended courses, which can be either
40 A. S. Belenky
core ones or electives, the same features that attract the attention of potential
new students should be present in the blended courses offered to already enrolled
college/university students. Also, one should bear in mind that the quality of
teaching these courses affects the intents of new potential students to enroll at
the college/university, since such intents are partly motivated by the information
on this quality received from its current students.
5. The proposed scheme of using fragments from recorded online courses is only
one of the options to use such recordings. Since most of these courses are
available free of charge, the college/university administration may decide to use
online courses as a substitution for offline ones, without designing and running
new (blended) courses. From the author’s teaching experience, it seems hard to
believe that the recorded online lectures can always substitute the presence of a
lecturer live in auditorium for a majority of the students studying these courses
while (at least) not positively affecting the percentage of the students who are
likely to study corresponding online courses successfully. In any case, it looks
reasonable to “test” both approaches to using the recorded online courses before
making a final decision on this matter.
6. Organizing tutorials to prepare particular groups of college/university students
for successfully studying new blended courses is always a challenge. The major
problem here is associated with timely preparing the students to (a) understand
the facts to be presented in a particular blended course, and (b) be comfortable
with the style of the material presentation by the teacher chosen to run this
blended course. The time left for this preparation much depends on when the
selection of the teachers to run the courses is completed, and the results of
providing corresponding tutorials substantially depend on who runs the tutorials
for each new course from the set of K blended courses. The latter may become a
complicated problem when the teacher chosen to run a particular blended course
is not currently employed at the college/university.
Also, one needs to emphasize that a teacher chosen to teach a blended
course at a college/university is free to develop these courses on her/his own,
as long as she/he adheres to the structure of the course determined by the col-
lege/university administration. (However, discussing approaches to developing
particular blended courses in line with their assigned structures lies beyond the
scope of this paper.)
Finally, in the model (1), (2), it was assumed that the tutorials and English
language courses are organized and run only for blended courses to be taught by
the teachers invited from abroad. However, generally, this may not be the case.
Both kinds of course developers (from the University and from other universities
in the country) may require to run both tutorials for the blended courses to be
taught by them and the English language courses to be offered by corresponding
departments at the University.
Let M̃ ⊂ M and R̃ ⊂ R be subsets of those course developers who require
to organize and run tutorials (with or without additional courses in English
language to be organized by the college/university), let f˜ikU and f˜rk OU be the
Developing and Running a Set of Competitive College/University Blended Courses 41
costs or running these tutorials, and let ψ̃ikU and ψ̃ OU be the costs or running
rk
additional English language courses, requested by the developers from the
college/university and by those from other colleges/universities in the country,
respectively. Then, for instance, the first (of the T ) inequalities in (2) will take
the form
L
K
L
R
K
R
wl (al + hl ) + lk zlk + sr (br + gr ) + δrk yrk +
l=1 k=1 l=1 r=1 k=1 r=1
K
M
K
M
K
K
. ∇ik xik + di tik xik + fk uk + f˜ikU xik (6)
k=1 i=1 k=1 i=1 k=1 i∈M̃ k=1
K
K
K
+ f˜rk
OU
yrk + U
ψ̃ik xik + OU
ψ̃rk yrk ≤ B − B0 .
r∈R̃ k=1 i∈M̃ k=1 r∈R̃ k=1
Here, it’s assumed that if developer i ∗ from the set M̃ doesn’t require to
organize and run tutorials for blended course k ∗ ∈ 1, K, the coefficient f˜iU∗ k ∗
equals zero. The same assumption holds for the coefficient f˜rOU ∗∗ k ∗∗ (if developer
∗∗
r from the set R̃ doesn’t require to organize and run tutorials for blended course
k ∗∗ , k ∗∗ ∈ 1, K) and for the coefficients ψ̃iU∗ k ∗ and ψ̃rOU
∗∗ k ∗∗ .
7. Generally, the college/university administration may exercise two approaches to
using recorded fragments of online lectures in designing blended courses, which
differ in the budget for running these courses. That is, it can offer the same set
of blended courses on a yearly basis within, say, T years so that no new blended
courses are added within these T years (see Assumption 5 from Sect. 3). The
other one implies widening the spectrum of blended courses yearly (or even more
often) depending on the results at the end of a particular year (or even on those of
each year), including financial results associated with running blended courses.
One can easily be certain that the proposed model can be used in formalizing
corresponding mathematical problems in the framework of the second approach
though the structure of the system of constraints in the corresponding problems
will be different. That is, this system will have a block-diagonal structure binding
variables related to each particular year (during T years) within a separate block,
along with a block binding all the variables of the system of constraints [60].
8. The proposed tool demonstrates the potential of an adequate mathematical
modelling, systems analysis, operations research techniques, and standard opti-
mization software in solving practical problems arising in the economics of
education.
9. One should bear in mind that if inequality (2) holds for the first year in the set
of T years, the restriction for a yearly budget for all the K courses will also
hold for all the other years from this set (since all the expenses associated with
the relocation of the invited teachers, which are reflected in (2), as well as the
expenses B0 , take place in this first year).
42 A. S. Belenky
10. In the models underlying the formulations of Problems 1 and 2, it’s assumed
that both the set of potential course developers from other colleges/universities
in the country and that from those to be invited from abroad are known to
the college/university administration in advance, before these two problems
are solved. Also, it’s assumed that both the tutorials (for each special course)
and the English language courses, provided by the teachers, are taken by the
students during the period of T years only once. Finally, it’s assumed that any
substitution of the teachers from other colleges/universities in the country and
from abroad for the years from the set of T years can be done only by the
college/university teachers (unless the relocation expenses of an invited teacher,
not currently employed by the college/university, are covered by it), and the
values ci , i ∈ 1, M, are taken into account.
One of further research directions of studying both problems considered in this
paper should concern the development of approaches to solving these problems
under uncertainty conditions. This uncertainty is associated with the impossibility
to determine the exact values of expert estimates of the parameters αik , βrk , γlk —
percentages of the students expected to succeed in studying each of K blended
courses with respect to every teacher considered by the college/university admin-
istration as a candidate for teaching this course—in principle. Though the proposed
tool lets the college/university administration conduct multiple calculations for any
number of sets of these expert estimates that the administration may be interested to
explore, one should understand that information on the values of these estimates is
that of a probabilistic nature. Moreover, establishing any particular regularities and
parameters of distribution laws describing the dynamics of these parameters seems
to present considerable difficulties.
At the same time, there exist approaches to treating similar information (on
parameter values in corresponding mathematical models) that let formulate prob-
lems under uncertainty conditions similar to those considered in this paper as linear
or mixed programming ones solving which can be done by using standard software
packages, available even on PCs [61]. However, the applicability of such approaches
to considered problems depends on whether assertions similar to those presented in
[61] can be mathematically proven.
References
1. Kolowich, S.: Exactly how many students take online courses. The Chronicle of Higher
Education, January 16 (2014)
2. Joksimovic, S., Gasevic, D., Loughin, T.M., Kovanovic, V., Hatala, M.: Learning at distance:
effects of interaction traces on academic achievement. Comput. Educ. 87, 204–217 (2015)
3. Carnoy, M., Kuz’minov, Y.: Online learning: how it affects the university structure and
economics. Panel discussion. Educ. Stud. Moscow, 3, 8–43 (2015)
4. Asarta, C.J., Schmidt, J.R.: Comparing student performance in blended and traditional courses:
does prior academic achievement matter? Internet High. Educ. 32, 29–38 (2017)
Developing and Running a Set of Competitive College/University Blended Courses 43
5. Zhu, Y., Au, W., Yates, G.: University students’ self-control and self-regulated learning in a
blended course. Internet High. Educ. 30, 54–62 (2016)
6. Zacharis, N.Z.: A multivariate approach to predicting student outcomes in web-enabled
blended learning courses. Internet High. Educ. 27, 44–53 (2015)
7. Olitsky, N.H., Cosgrove, S.B.: The effect of blended courses on student learning: evidence
from introductory economics courses. Int. Rev. Econ. Educ. 15, 17–31 (2015)
8. Alvarez, A., Martin, M., Fernandez-Castro, I., Urretavizcaya, M.: Blending traditional teaching
methods with learning environments: Experience, cyclical evaluation process and impact with
MAgAdI. Comput. Educ. 68, 129–140 (2013)
9. Yigit, T., Koyun, A., Yuksel, A.S., Cankaya, I.A.: Evaluation of blended learning approach in
computer engineering education. Procedia Soc. Behav. Sci. 141, 807–812 (2014)
10. Winston, G.C., Zimmerman, D.J.: Peer effects in higher education. In: The Economics of
Where to Go, When to go, and How to Pay for it, pp. 395–421. University of Chicago Press
(2004)
11. Griffith, A.L., Rask, K.N.: Peer effects in higher education: a look at heterogeneous impacts.
Econ. Educ. Rev. 39, 65–77 (2014)
12. Bazerman, M., Moore, D.: Judgment in Managerial Decision Making, 8th edn. Wiley (2012)
13. Gilboa, I.: Making Better Decisions: Decision Theory in Practice, 1st edn. (Wiley/Blackwell,
2010)
14. Saaty, T.: Decision Making for Leaders: The Analytic Hierarchy Process for Decisions in a
Complex World, 3rd Revised edn. RWS Publications (2012)
15. Noyes, J., Cook, M.: Decision Making in Complex Environments, 1st edn. CRC Press (2017)
16. de Haan, A., de Heer, P.: Solving Complex Problems: Professional Group Decision-Making
Support in Highly Complex Situations, 2nd Revised edn. Eleven International Publishing
(2015)
17. Hammond, J., Keeney, R., Raiffa, H.: Smart Choices: A Practical Guide to Making Better
Decisions. Crown Business (2002)
18. Adair, J.: Decision Making and Problem Solving (Creating Success). Kogan Page (2013)
19. Kim, J.Y.: A study on learners’ perceptional typology and relationships among the learner’s
types, characteristics, and academic achievement in a blended e-education environment.
Comput. Educ. 59(2), 304–315 (2012)
20. Linder, K.E.: The Blended Course Design Workbook: A Practical Guide, Workbook edn.
Stylus Publishing (2016)
21. Kalberer, N., Petendra, B., Bohmer, C., Schibelbein, A., Beck-Meuth, E.M.: Evaluation process
and quality management in a blended-learning bachelor’s programme. Procedia Soc. Behav.
Sci. 228, 131–137 (2016)
22. Zlatovic, M., Balaban, I., Kermek, D.: Using online assessments to stimulate learning strategies
and achievement of learning goals. Comput. Educ. 91, 32–45 (2015)
23. Barak, M., Watted, A., Haick, H.: Motivation to learn in massive open online courses:
examining aspects of language and social engagement. Comput. Educ. 94(3), 49–80 (2016)
24. Sendra-Portero, F., Torales-Chaparro, O.E., Ruiz-Gomez, M.J., Martinez-Morillo, M.: A pilot
study to evaluate the use of virtual lectures for undergraduate radiology teaching. Eur. J. Radiol.
82(5), 888–893 (2013)
25. Yilmaz, O.: The effects of “live online courses” on students’ achievement at distance education.
Procedia Soc. Behav. Sci. 55(5), 347–354 (2012)
26. Song, H., Kim, J., Luo, W.: Teacher-student relationship in online classes: a role of teacher
self-disclosure. Comput. Hum. Behav. 54, 436–443 (2016)
27. Shukor, N.A., Tasir, Z., der Meijden, H.V.: An examination of online learning effectiveness
using data mining. Procedia Soc. Behav. Sci. 172, 555–562 (2015)
28. Cho, Y.H., Choi, H., Shin, J., Yu, H.C., Kim, Y.K., Kim, J.Y.: Review of research on online
learning environments in higher education. Procedia Soc. Behav. Sci. 191(2), 2012–2017
(2015)
29. Beach, P.: Self-directed online learning: a theoretical model for understanding elementary
teachers’ online learning experiences. Teach. Teacher Educ. 61, 60–72 (2017)
44 A. S. Belenky
30. Rienties, B., Brouwer, N.Z., Lygo-Baker, S.: The effects of online professional development on
higher education teachers’ beliefs and intentions towards learning facilitation and technology.
Teach. Teacher Educ. 29(1), 122–133 (2013)
31. Karaman, S., Kucuk, S., Aydemir, M.: Evaluation of an online continuing education program
from the perspective of new graduate nurses. Nurse Educ. Today 34(5), 836–841 (2014)
32. Abdelaziz, M., Kamel, S.S., Karam, O., Abdelrahman, A.: Evaluation of e-learning program
versus traditional lecture instruction for undergraduate nursing students in a faculty of nursing.
Teach. Learn. Nurs. 6(2), 50–58 (2011)
33. Pei, L., Wu, H.: Does online learning work better than offline learning in undergraduate medical
education? A systematic review and meta-analysis. Med. Educ. Online 24(1), 1–13 (2019)
34. Mubarak, A., Al-Arimi, A.K.: Distance learning. Procedia Soc. Behav. Sci. 152, 82–88 (2014)
35. Tsiotakis, P., Jimoyiannis, A.: Critical factors towards analyzing teachers’ presence in on-line
learning communities. Internet High. Educ. 28(1), 45–58 (2016)
36. van Rooij, S.W., Zirkle, K.: Balancing pedagogy, student readiness and accessibility: a case
study in collaborative online course development. Internet High. Educ. 28(1), 1–7 (2016)
37. Markova, T., Glazkova, I., Zaborova, E.: Quality issues of online distance learning. Procedia
Soc. Behav. Sci. 237, 685–691 (2017)
38. Cohen, A., Baruth, O.: Personality, learning, and satisfaction in fully online academic courses.
Comput. Hum. Behav. 72, 1–12 (2017)
39. Boling, E.C., Hough, M., Krinsky, H., Saleem, H., Stevens, M.: Cutting the distance in distance
education: perspectives on what promotes positive, online learning experiences. Internet High.
Educ. 15(2), 118–126 (2012)
40. Moore, J.L., Dickson-Deane, C., Galyen, K.: E-learning, online learning, and distance learning
environments: are they the same? Internet High. Educ. 14(2), 129–135 (2011)
41. Sanders-Smith, S., Smith-Bonahue, T., Soutullo, O.: Practicing teachers’ responses to case
method of instruction in an online graduate course. Teach. Teacher Educ. 54(2), 1–11 (2016)
42. Vo, H.M., Zhu, C., Diep, N.A.: The effect of blended learning on student performance at
course-level in higher education: a meta-analysis. Stud. Educ. Eval. 53, 17–28 (2017)
43. Ocak, M.A.: Why are faculty members not teaching blended courses? Insights from faculty
members. Comput. Educ. 56(3), 689–693 (2011)
44. Lust, G., Elen, J., Clarebout, G.: Regulation of tool-use within a blended course: Student
differences and performance effects. Comput. Educ. 60(1), 385–395 (2013)
45. Ellis, R.A., Pardo, A., Han, F.: Quality in blended learning environments - significant
differences in how students approach learning collaborations. Comput. Educ. 102, 90–102
(2016)
46. Boelens, R., De Wever, B., Voet, M.: Four key challenges to the design of blended learning: a
systematic literature review. Educ. Res. Rev. 22, 1–18 (2017)
47. Dziuban, C., Moskal, P.: A course is a course is a course: factor invariance in student evaluation
of online, blended and face-to-face learning environments. Internet High. Educ. 14(4), 236–
241 (2011)
48. Nazarenko, A.L.: Blended learning vs traditional learning: what works? (a case study research).
Procedia Soc. Behav. Sci. 200, 77–82 (2015)
49. Slechtova P., Hana, V., Jan, V.: Blended learning: promising strategic alternative in higher
education. Procedia Soc. Behav. Sci. 171, 1245–1254 (2015)
50. Sophonhiranrak, S., Suwannatthachote, P., Ngudgratoke, S.: Factors affecting creative problem
solving in the blended learning environment: a review of the literature. Procedia Soc. Behav.
Sci. 174, 2130–2136 (2015)
51. Havnes, A., Christiansen, B., Bjork, I.T., Hessevaagbakke, E.: Peer learning in higher educa-
tion: patterns of talk and interaction in skills centre simulation. Learn. Cult. Soc. Interaction 8,
75–87 (2016)
52. Brunello, G., De Paola, M., Scoppa, V.: Residential peer effects in higher education: does the
field of study matter? Econ. Enquiry 48(3), 621–634 (2010)
Developing and Running a Set of Competitive College/University Blended Courses 45
53. Sacerdote, B.: Peer effects in education: How might they work, how big are they, and how
much do we know thus far? In: Handbook of the Economics of Education, vol. 3, pp. 249–277
(2010)
54. Kashefi, H., Ismail, Z., Yusof, Y.M.: Overcoming students obstacles in multivariable calculus
through blended learning: a mathematical thinking approach. Procedia Soc. Behav. Sci. 56,
579–586 (2012)
55. Savenye, W.C., Olina, Z., Niemczyk, M.: So you are going to be an online writing instructor:
issues in designing, developing, and delivering an online course. Comput. Compos. 18(4),
371–385 (2001)
56. Mo, Z.: Research on the teaching model application tendency of blended learning in the
displayed design course. IERI Procedia 2, 204–208 (2012)
57. McGee, P., Reis, A.: Blended course design: a synthesis of best practices. J. Asynchronous
Learn. Netw. 16(4), 7–22 (2012)
58. Vai, M., Sosulski, K.: Essentials of Online Course Design: A Standards-Based Guide (Essen-
tials of Online Learning), 2nd edn. Routledge (2015)
59. Belenky, A.: Analyzing the potential of a firm: an operations research approach, Autom.
Remote Control 35(13), 1405–1424 (2002)
60. Hu, T.C., Kahng, A.B.: Linear and Integer Programming Made Easy, 1st edn. Springer (2016)
61. Belenky, A.: Two classes of games on polyhedral sets in systems economic studies. In: Network
Models in Economics and Finance, pp. 35–80. Springer (2014)
SARAH-Based Variance-Reduced
Algorithm for Stochastic Finite-Sum
Cocoercive Variational Inequalities
1 Introduction
. min f (z).
z∈Rd
To represent it in the form (1), it is sufficient to take .F (z) = ∇f (z). As another also
popular practical example, we can consider a saddle point or min-max problem:
Here we need to take .F (z) = [∇x g(x, y), −∇y g(x, y)].
A. Beznosikov
Moscow Institute of Physics and Technology, Dolgoprudny, Russia
HSE University, Moscow, Russia
A. Gasnikov ()
Moscow Institute of Physics and Technology, Dolgoprudny, Russia
IITP RAS, Moscow, Russia
Caucasus Mathematical Center, Adyghe State University, Maikop, Russia
e-mail: [email protected]
1
n
F (z) =
. Fi (z). (2)
n
i=1
In the case of minimization problems, statements of the form (1) + (2) are also called
empirical risk minimization [37]. These types of problems arise both in classical
machine learning problems such as simple regressions and in complex, large-scale
problems such as neural networks [23]. When it comes to saddle point problems, in
recent times the so-called adversarial approach has become popular. Here one can
highlight Generative Adversarial Networks (GANs) [13] and the adversarial training
of models [25, 40].
Based on the examples mentioned above, it can be noted that for operators of the
form (2), computing the full value of F is a possible but not desirable operation,
since it is typically very expensive compared to computing a single operator .Fi .
Therefore, when constructing an algorithm for the problem (1) + (2), one wants to
avoid computing (or compute very rarely) the full F operator. This task can be
solved by a stochastic gradient descent (SGD) framework. Currently, stochastic
methods for minimization problems already have a huge background [14]. The first
methods of this type were proposed back in the 1950s by Robbins and Monro [35].
For example, in the most classic variant, SGD could be written as follows:
zk+1 = zk − ηv k ,
. (3)
[n] is picked at random and the point .z̃ is updated very rarely (hence we do not
need to compute the full gradient often). With this type of methods it is possible to
achieve a linear convergence to the solution. But for both convex and non-convex
SARAH for Stochastic Cocoercive Variational Inequalities 49
Recall that we consider the problem (1), where the operator F has the form (2).
Additionally, we assume
Assumption 1 (Cocoercivity) Each operator Fi is -cocoercive, i.e. for all u, v ∈
Rd we have
. Fi (u) − Fi (v) 2
≤ Fi (u) − Fi (v), u − v. (4)
F (u) − F (v); u − v ≥ μ u − v 2 .
. (5)
For minimization problems this property means strong convexity, and for saddle
point problems strong convexity–strong concavity.
3 Main Part
fact that we consider cocoercive VIs, it is sufficient to look at SGD like methods for
this class of problems. For example, [26] considers SGD, [6]—SVRG. Following
this reasoning, we base our method on the original SARAH [29].
Next, we analyse the convergence of this method. Note that we will use the vector
v K in the analysis, but in reality this vector is not calculated by the algorithm. Our
.
proof are heavily based on the original work on SARAH [29]. Lemma 1 gives an
understanding of how . v k 2 behaves during the internal loop of Algorithm 1.
Lemma 1 Suppose that Assumptions 1 and 2 hold. Consider SARAH (Algorithm 1)
with .γ ≤ 1 . Then, we have
.E[ v K 2
] ≤(1 − γ μ)K E[ F (z0 ) 2 ].
. vk 2
= v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
+ 2Fik (zk ) − Fik (zk−1 ), v k−1 .
2
. vk 2
= v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
− Fik (zk ) − Fik (zk−1 ), zk − zk−1
γ
1
= v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
− Fik (zk ) − Fik (zk−1 ), zk − zk−1
γ
1
− Fik (zk ) − Fik (zk−1 ), zk − zk−1 .
γ
SARAH for Stochastic Cocoercive Variational Inequalities 51
E[ v k
.
2
] =E[ v k−1 2 ] + E[ Fik (zk ) − Fik (zk−1 ) 2 ]
1
− E[Fik (zk ) − Fik (zk−1 ), zk − zk−1 ]
γ
1
− E[Fik (zk ) − Fik (zk−1 ), zk − zk−1 ].
γ
. E[ v k 2
] =E[ v k−1 2 ] + E[ Fik (zk ) − Fik (zk−1 ) 2 ]
1
− E[Fik (zk ) − Fik (zk−1 ), zk − zk−1 ]
γ
1
− E[Eik [Fik (zk ) − Fik (zk−1 )], zk − zk−1 ]
γ
=E[ v k−1 2 ] + E[ Fik (zk ) − Fik (zk−1 ) 2 ]
1
− E[Fik (zk ) − Fik (zk−1 ), zk − zk−1 ]
γ
1
− E[F (zk ) − F (zk−1 ), zk − zk−1 ].
γ
E[ v k
.
2
] ≤E[ v k−1 2 ] + E[ Fik (zk ) − Fik (zk−1 ) 2 ]
1
− E[ Fik (zk ) − Fik (zk−1 ) 2 ]
γ
μ
− E[ zk − zk−1 2 ]
γ
γ − 1
=(1 − γ μ)E[ v k−1 2
]+ E[ Fik (zk ) − Fik (zk−1 ) 2 ].
γ
E[ v k
.
2
] ≤(1 − γ μ)E[ v k−1 2 ].
Lemma 2 Suppose that Assumption 1 holds. Consider SARAH (Algorithm 1). Then,
we have
γ
E[ F (zK ) − v K
.
2
]≤ E[ F (z0 ) 2 ].
2 − γ
E[ F (zk ) − v k
.
2
] =E[ [F (zk−1 ) − v k−1 ] + [F (zk ) − F (zk−1 )] − [v k − v k−1 ] 2 ]
=E[ F (zk−1 ) − v k−1 2 ] + E[ F (zk ) − F (zk−1 ) 2 ]
+ E[ v k − v k−1 2 ]
+ 2E[F (zk−1 ) − v k−1 , F (zk ) − F (zk−1 )]
− 2E[F (zk−1 ) − v k−1 , v k − v k−1 ]
− 2E[F (zk ) − F (zk−1 ), v k − v k−1 ]
=E[ F (zk−1 ) − v k−1 2 ] + E[ F (zk ) − F (zk−1 ) 2 ]
+ E[ v k − v k−1 2 ]
+ 2E[F (zk−1 ) − v k−1 , F (zk ) − F (zk−1 )]
− 2E[F (zk−1 ) − v k−1 , Eik [v k − v k−1 ]]
− 2E[F (zk ) − F (zk−1 ), Eik [v k − v k−1 ]]
=E[ F (zk−1 ) − v k−1 2 ] − E[ F (zk ) − F (zk−1 ) 2 ]
+ E[ v k − v k−1 2 ]
≤E[ F (zk−1 ) − v k−1 2 ] + E[ v k − v k−1 2 ].
K
E[ F (zK ) − v K
.
2
]≤ E[ v k − v k−1 2 ]. (6)
k=1
. vk 2
= v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
+ 2Fik (zk ) − Fik (zk−1 ), v k−1
2
= v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
− Fik (zk ) − Fik (zk−1 ), zk − zk−1
γ
SARAH for Stochastic Cocoercive Variational Inequalities 53
2
≤ v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
− Fi (zk ) − Fik (zk−1 ) 2
γ k
γ − 2
= v k−1 2
+ Fik (zk ) − Fik (zk−1 ) 2
γ
γ − 2
= v k−1 2
+ v k − v k−1 2 .
γ
γ
E[ v k − v k−1 2 ] ≤
. E[ v k−1 2
− vk 2
].
2 − γ
By substituting this into the expression (6) and using .v 0 = F (z0 ), we finish the
proof.
Let us combine Lemmas 1 and 2 into the main theorem of this paper.
Theorem 1 Suppose that Assumptions 1 and 2 hold. Consider SARAH (Algo-
rithm 1) with .γ = 9
2
and .K = 10
μ . Then, we have
1
E[ F (z̃s ) 2 ] ≤
. E[ F (z̃s−1 ) 2 ].
2
Proof We start from
1
E[ F (zK ) 2 ] ≤
. E[ F (z0 ) 2 ].
2
1
E[ F (z̃s ) 2 ] ≤
. E[ F (z̃s−1 ) 2 ].
2
54 A. Beznosikov and A. Gasnikov
Since we need to find a point z such that .F (z) ≈ F (z∗ ) = 0, we can easily get
an estimate on the oracle complexity (number of .Fi calls) to achieve precision .ε.
Corollary 1 Suppose that Assumptions 1 and 2 hold. Consider SARAH (Algo-
rithm 1) with .γ = 9
2
μ . Then, to achieve .ε-solution (.E F (z̃ )
and .K = 10 S 2 ∼ ε 2 ),
we need
F (z0 ) 2
.O n+ log2 oracle calls.
μ ε2
At each outer iteration we compute the full operator one time, and at the remaining
K − 1 iterations we call the single operator .Fi two times per one inner iteration.
.
Note that the obtained oracle complexity coincides with the similar complexity
for SVRG from [6]. It is interesting to see how these methods behave in practice.
4 Experiments
The aim of our experiments is to compare the performance of different methods for
stochastic finite-sum cocoercive variational inequalities. In particular, we use SGD
from [26], SVRG from [6] and SARAH. We conduct our experiments on a finite-
sum bilinear saddle point problem:
n
1 λ λ
.g(x, y) = gi (x, y) = x Ai y + ai x + bi y + x 2
− y 2
, (7)
n 2 2
i=1
Fig. 1 Bilinear problem (7): Comparison of state-of-the-art SGD-based methods for stochastic
cocoercive VIs. (a) Small .. (b) Medium .. (c) Large .
taken as . λ . We run three experiment setups: with small . ≈ 102 , medium . ≈ 103
and big . ≈ 104 .
See Fig. 1 for the results. We see that SARAH converges better than SVRG, and
SGD converges much slower.
Acknowledgments The authors would like to congratulate Boris Mirkin on his jubilee and wish
him good health and new scientific advances.
The work of A. Beznosikov was supported by the strategic academic leadership program ‘Pri-
ority 2030’ (Agreement 075-02-2021-1316 30.09.2021). The work of A. Gasnikov was supported
by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye), No.
075-00337-20-03, project No. 0714-2020-0005.
References
1. Alacaoglu, A., Malitsky, Y.: Stochastic variance reduction for variational inequality methods.
Preprint. arXiv:2102.08352 (2021)
2. Alacaoglu, A., Malitsky, Y., Cevher, V.: Forward-reflected-backward method with variance
reduction. Comput. Optim. Appl. 80 (2021). https://doi.org/10.1007/s10589-021-00305-3
3. Allen-Zhu, Z., Yuan, Y.: Improved SVRG for non-strongly-convex or sum-of-non-convex
objectives. In: International Conference on Machine Learning, pp. 1080–1089. PMLR (2016)
4. Beznosikov, A., Samokhin, V., Gasnikov, A.: Distributed saddle-point problems: lower bounds,
optimal and robust algorithms. Preprint. arXiv:2010.13112 (2020)
5. Beznosikov, A., Gasnikov, A., Zainulina, K., Maslovskiy, A., Pasechnyuk, D.: A unified
analysis of variational inequality methods: variance reduction, sampling, quantization and
coordinate descent. Preprint. arXiv:2201.12206 (2022)
6. Beznosikov, A., Gorbunov, E., Berard, H., Loizou, N.: Stochastic gradient descent-ascent:
unified theory and new efficient methods. Preprint. arXiv:2202.07262 (2022)
7. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning.
Siam Rev. 60(2), 223–311 (2018)
8. Chavdarova, T., Gidel, G., Fleuret, F., Lacoste-Julien, S.: Reducing noise in gan training with
variance reduced extragradient. 32, 393–403 (2019)
9. Cutkosky, A., Orabona, F.: Momentum-based variance reduction in non-convex SGD. Preprint.
arXiv:1905.10018 (2019)
10. Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: a fast incremental gradient method with support
for non-strongly convex composite objectives. In: Advances in Neural Information Processing
Systems, pp. 1646–1654 (2014)
56 A. Beznosikov and A. Gasnikov
11. Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: near-optimal non-convex optimization via
stochastic path integrated differential estimator. Preprint. arXiv:1807.01695 (2018)
12. Gidel, G., Berard, H., Vignoud, G., Vincent, P., Lacoste-Julien, S.: A variational inequality
perspective on generative adversarial networks. Preprint. arXiv:1802.10551 (2018)
13. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
A., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
14. Gorbunov, E., Hanzely, F., Richtárik, P.: A unified theory of SGD: Variance reduction,
sampling, quantization and coordinate descent. In: International Conference on Artificial
Intelligence and Statistics, pp. 680–690. PMLR (2020)
15. Gorbunov, E., Berard, H., Gidel, G., Loizou, N.: Stochastic extragradient: general analysis and
improved rates. In: International Conference on Artificial Intelligence and Statistics, pp. 7865–
7901. PMLR (2022)
16. Hsieh, Y.G., Iutzeler, F., Malick, J., Mertikopoulos, P.: On the convergence of single-call
stochastic extra-gradient methods. In: Advances in Neural Information Processing Systems,
vol. 32, pp. 6938–6948. Curran Associates, Inc. (2019)
17. Hsieh, Y.G., Iutzeler, F., Malick, J., Mertikopoulos, P.: Explore aggressively, update conser-
vatively: Stochastic extragradient methods with variable stepsize scaling. 33, 16223–16234
(2020)
18. Hu, W., Li, C.J., Lian, X., Liu, J., Yuan, H.: Efficient smooth non-convex stochastic
compositional optimization via stochastic recursive gradient descent. In: Advances in Neural
Information Processing Systems. vol. 32, pp. 6929–6937. Curran Associates, Inc. (2019)
19. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance
reduction. In: Advances in Neural Information Processing Systems, vol. 26, pp. 315–323
(2013)
20. Juditsky, A., Nemirovski, A., Tauvel, C.: Solving variational inequalities with stochastic
mirror-prox algorithm. Stoch. Syst. 1(1), 17–58 (2011)
21. Khaled, A., Richtárik, P.: Better theory for SGD in the nonconvex world. Preprint.
arXiv:2002.03329 (2020)
22. Kovalev, D., Beznosikov, A., Borodich, E., Gasnikov, A., Scutari, G.: Optimal gradient sliding
and its application to distributed optimization under similarity. Preprint. arXiv:2205.15136
(2022)
23. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
24. Li, Z., Bao, H., Zhang, X., Richtárik, P.: Page: a simple and optimal probabilistic gradient
estimator for nonconvex optimization. In: International Conference on Machine Learning, pp.
6286–6295. PMLR (2021)
25. Liu, X., Cheng, H., He, P., Chen, W., Wang, Y., Poon, H., Gao, J.: Adversarial training for large
neural language models. Preprint. arXiv:2004.08994 (2020)
26. Loizou, N., Berard, H., Gidel, G., Mitliagkas, I., Lacoste-Julien, S.: Stochastic gradient
descent-ascent and consensus optimization for smooth games: convergence analysis under
expected co-coercivity. In: Advances in Neural Information Processing Systems, vol. 34, pp.
19095–19108 (2021)
27. Mairal, J.: Incremental majorization-minimization optimization with application to large-scale
machine learning. SIAM J. Optim. 25(2), 829–855 (2015)
28. Mishchenko, K., Kovalev, D., Shulgin, E., Richtárik, P., Malitsky, Y.: Revisiting stochastic
extragradient. In: International Conference on Artificial Intelligence and Statistics, pp. 4573–
4582. PMLR (2020)
29. Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine
learning problems using stochastic recursive gradient. In: International Conference on Machine
Learning, pp. 2613–2621. PMLR (2017)
30. Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: Stochastic recursive gradient algorithm for
nonconvex optimization. Preprint. arXiv:1705.07261 (2017)
31. Nguyen, L.M., Nguyen, P.H., Richtárik, P., Scheinberg, K., Takác, M., van Dijk, M.: New
convergence aspects of stochastic gradient algorithms. J. Mach. Learn. Res. 20(176), 1–49
(2019), http://jmlr.org/papers/v20/18-759.html
SARAH for Stochastic Cocoercive Variational Inequalities 57
32. Nguyen, L.M., Scheinberg, K., Takáč, M.: Inexact SARAH algorithm for stochastic optimiza-
tion. Optim. Methods Softw. 36(1), 237–258 (2021)
33. Palaniappan, B., Bach, F.: Stochastic variance reduction methods for saddle-point problems.
In: Advances in Neural Information Processing Systems, pp. 1416–1424 (2016)
34. Qian, X., Qu, Z., Richtárik, P.: Saga with arbitrary sampling. In: International Conference on
Machine Learning, pp. 5190–5199. PMLR (2019)
35. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407
(1951)
36. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average
gradient. Math. Program. 162(1–2), 83–112 (2017)
37. Shalev-Shwartz, S., Ben-David, S.: Understanding machine learning: from theory to algo-
rithms. Cambridge University Press (2014)
38. Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient
solver for SVM. Math. Program. 127(1), 3–30 (2011)
39. Yang, Z., Chen, Z., Wang, C.: Accelerating mini-batch SARAH by step size rules. Inf. Sci.
558, 157–173 (2021)
40. Zhu, C., Cheng, Y., Gan, Z., Sun, S., Goldstein, T., Liu, J.: Freelb: enhanced adversarial training
for natural language understanding. Preprint. arXiv:1909.11764 (2019)
Dimensionality Reduction Using
Pseudo-Boolean Polynomials for Cluster
Analysis
1 Introduction
T. M. Chikake ()
Department of Discrete Mathematics, Moscow Institute of Physics and Technology, Moscow,
Russian Federation
e-mail: [email protected]
B. Goldengorin
Department of Mathematics, New Uzbekistan University, Tashkent, Uzbekistan
The Scientific and Educational Mathematical Center «Sofia Kovalevskaya Northwestern Center
for Mathematical Research», Pskov State University, Pskov, Russia
Department of Discrete Mathematics, Phystech School of Applied Mathematics and Informatics,
Moscow Institute of Physics and Technology, Dolgoprudny, Moscow region, Russia
e-mail: [email protected]; [email protected]
datasets. Automated systems like artificial neural networks can be used in the feature
selection processes [3], but often require large amounts of data and challenges like
feature superposition [4], underfitting, overfitting, and interpretation concerns arise
[5].
In fields where large numbers of observations and/or large numbers of variables
exist such as signal processing, computer vision, speech recognition, neuroinfor-
matics, and bioinformatics, usage of dimension reduction techniques is crucial
[6]. Dimensionality reduction simplifies cluster analysis tasks for both human and
machine processors [6].
In this work, we observe that our powerful dimensionality reduction method can
assist in reducing the abuse of statistical methods and/or artificial neural networks
in tasks that can be solved in combinatorial steps. We show this by qualifying
our dimensionality reduction method, followed by linear clustering of reduced
samples. This advantage is available because our reduction method has invariability
properties, and it is intuitively easy to interpret. The problems of invariability and
interpretability are among the top problems of currently available dimensionality
reduction methods [7]. Dimensionality reduction tools that lack interpretability
and/or invariability can be disfavoured in critical tasks such as clustering models
that input medical tests/measurements to predict a diagnosis. Our work is directed
to solving such challenges.
Cluster analysis or clustering is the task of grouping a set of objects in such
a way that objects in the same group (called a cluster) are more similar (in some
sense) to each other than to those in other clusters [8]. Clustering is a major task of
exploratory data analysis, and a common technique for statistical data analysis, used
in many fields, including pattern recognition, image analysis, information retrieval,
bioinformatics, data compression, computer graphics and machine learning [8].
Abuse of statistical methods and/or artificial neural networks often arise in cases
where data is minimal. This abuse may result in overfitted, irreproducible or hard-
to-interpret solutions which may be undesirable to use in some critical tasks.
The invariant manipulation based on formulation of pseudo-Boolean polyno-
mials presented in this work, can enable cluster analysts to extract simple rules
of associations in data without the need for machine learning. The formulation of
pseudo-Boolean polynomials is very simple, computationally efficient, invariant to
ordering, and easy to explain and reproduce.
Our overall contributions in this paper are:
1. Qualifying the usage of the reduction property of penalty-based pseudo-Boolean
polynomials formulation for dimensionality reduction of multidimensional data
where it is feasible.
2. Reducing overdependence on data-driven approaches in solving problems that
can be solved with combinatorial steps.
Dimensionality reduction using pseudo-Boolean polynomials formulation,
revolves around the manipulation of the reduction and equivalence properties
of penalty-based pseudo-Boolean polynomials [9].
Dimensionality Reduction Using Pseudo-Boolean Polynomials for Cluster Analysis 61
2 Related Work
3 Methods
cost
.fc (S) = min{cij |i ∈ S}, (1)
j ∈J
with the assumption that entries of .C are non-negative and finite [20].
According to AlBdaiwi et al. [20], the objective function .fc (S) of this kind
of problem can be formulated in terms of pseudo-Boolean polynomials and from
Dimensionality Reduction Using Pseudo-Boolean Polynomials for Cluster Analysis 63
where .p is the output dimension size, which we however choose to be .|J | such that
no measurement is lost.
By processing our samples into pseudo-Boolean polynomials we achieve the
sample representations with the least possible information costs needed to represent
them.
4 Experimental Setup
Table 1 shows the Iris Flower dataset [11], a classical and popular dataset in the
machine learning community. The task on this dataset is to classify Iris plants into
three species (Iris setosa, Iris versicolor, and Iris virginica) using the lengths and
widths measurements of their petals and sepals. The dataset contains 150 samples.
Our method requires that data is structured as matrices, where columns are
measurements and rows are the features measured. For this dimensionality reduction
task, we first reshape the structure of all instances from .1 × 4 sized instances to
.2 × 2 sized instances, where rows represent the measurement type (Sepal, Petal)
and the columns represent the measurements (Length and Width) of the features
collected from 3 different species:
One cluster in the Iris Flower dataset, Iris-Setosa is linearly separable from others
while Iris-virginica and Iris-versicolour are not linearly separable between each
other as shown in Figs. 1 and 2.
Figure 1 shows the scatter plot on lengths and widths of sepal measurements
of the samples while Fig. 2 shows the scatter plot on lengths and widths of petal
measurements of the samples.
Applying the pseudo-Boolean polynomials reduction program, we reduce all
samples to .2 × 1 matrices. After applying combinations of like terms and dropping
of zero columns, the pseudo-Boolean polynomials of the instance in Table 2 reduces
to
6.0
.
2.4y2
The Wisconsin Diagnostic Breast Cancer (WDBC) [10] dataset, is another classical
and popular dataset in the machine learning community. The dataset has 569
instances of 30 real-valued features that describe characteristics of the cell nuclei
present in digitized images extracted by a fine needle aspirate (FNA) on a breast
mass [22].
Ten real-valued features were computed for each cell nucleus:
1. radius (mean of distances from centre to points on the perimeter)
2. texture (standard deviation of greyscale values)
3. perimeter
4. area
5. smoothness (local variation in radius lengths)
6. compactness (.perimeter 2 /area − 1.0)
7. concavity (severity of concave portions of the contour)
8. concave points (number of concave portions of the contour)
9. symmetry
10. fractal dimension (“coastline approximation” .− 1)
The mean, standard error (se), and “worst” or largest (mean of the three largest
values) of these features were computed for each image, resulting in 30 features
[22].
The task on this dataset is to predict whether a breast cancer diagnosis is benign
or malignant based on these features.
Dimensionality Reduction Using Pseudo-Boolean Polynomials for Cluster Analysis 67
The best known predictive accuracy (97.5%) was obtained by using a separating
plane in the 3-D space of Worst Area, Worst Smoothness and Mean Texture features
using repeated tenfold cross-validations [22], and this classifier has correctly
diagnosed 176 consecutive new patients as of November 1995 [10].
The separating plane was obtained using Multisurface Method-Tree (MSM-T)
[23], a classification method which uses linear programming to construct a decision
tree where relevant features are selected using an exhaustive search in the space of
1–4 features and 1–3 separating planes [23].
Samples of size .3 × 10, result in 30-Dimensionality representations, that are hard
to visualise on a Cartesian plot and consequently hard for the analyst to identify the
features that are useful in making an accurate diagnosis.
To qualify our dimensionality reduction tool, we input each sample as a
.3 × n; n ∈ 1, 2, 3, . . . , 10 matrix where rows .J represent the measurement type
(Mean, Standard error (se), Worst) and the columns .I the measured features.
Applying the pseudo-Boolean polynomials formulation, we reduce all samples to
.3 × 1 matrices which can be plotted and classified on a Cartesian space by a simple
boundary plane. Since our formulation runs in polynomial time, we can iterate
all possible arrangements of measured features and discover the set of measured
features which has the best fitting plane that separates samples into benign or
malignant.
and
5.1 3.7
.
1.5 0.4
all reduce to
1.9
.
6.9y2
and in perfect confirmation of the equivalence condition; the instances also lie in
the same cluster.
Looking at the single outlier,
5.1 3.7
.
1.5 0.4
as there are samples of Iris-virginica that have very little difference in measured
values to this outlier, while its values are also significantly distinct from the other
Iris-versicolor samples. The incorrectly clustered instance might be attributed to
misclassification by the persons who labelled the dataset, or perhaps just a naturally
occurring outlier.
This finding exposes, the overfitting concerns that arise from learning-based
methods, like the Space Vector Machine (SVM) and [21]’s X-Boosted artificial
neural network, that would report 100% accuracy in some tests. Additionally,
changing the train/test data, result in varied accuracies when these methods are
used, thereby losing the invariant and reproducibility attributes that our method
guarantees.
Plotting the coefficients of the reduced instances just as in the previous dataset,
allows us to visualize properly the possible clusters from the data.
The reduction correctly represents the original instances with less information
cost, and allows us to discover a combination of features with the best plane that
separates the respective clusters.
Of all the combinations of features explored, the combination with the best
separating plane (95.4% accuracy) consisted of
1. radius
2. texture
3. perimeter
4. smoothness
5. compactness
6. concavity
7. symmetry
8. fractal dimension
excluding area and concave points features (Fig. 4).
Instances with the shortlisted features are separable by a plane .z = .85x −2y −0.4
at an accuracy of 95.4%.
Although, falling short of the accuracy (97.5%) reported in [10], our method
manages to present the dimensionality reduction capacity of a simple and invariant
method that is solely based on manipulation of orderings.
As shown in the experiments results reported in this paper, our method can be
used for unsupervised clustering of multidimensional data as well as in feature
selection processes in cluster analysis. Our method allows us to have a better
understanding of how multidimensional features contribute to classification of
samples in an invariant and explainable manner and in some cases achieve unbiased
and unsupervised clustering in cluster analysis processes.
70 T. M. Chikake and B. Goldengorin
6 Conclusion
Acknowledgments Tendai Mapungwana Chikake and Boris Goldengorin’s research was sup-
ported by Russian Science Foundation project No. 21-71-30005.
Boris Goldengorin acknowledges Scientific and Educational Mathematical Center “Sofia
Kovalevskaya Northwestern Center for Mathematical Research” for financial support of the present
study (agreement No 075-02-2023-937, 16.02.2023).
References
1. Schölkopf, B., Smola, A., Müller, K.R.: Nonlinear component analysis as a Kernel eigenvalue
problem. Neural Comput. 10(5), 1299–1319 (1998)
2. Johnstone, I., Titterington, D.: Statistical challenges of high-dimensional data. Philos. Trans.
A Math. Phys. Eng. Sci. 367, 4237–53 (2009)
3. Notley, S., Magdon-Ismail, M.: Examining the use of neural networks for feature extraction:
a comparative analysis using deep learning, support vector machines, and K-nearest neighbor
classifiers. arXiv preprint arXiv:1805.02294 (2018)
4. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds,
Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D.,
Wattenberg, M., Olah, C.: Toy Models of Superposition. arXiv e-prints arXiv:2209.10652
(2022). https://doi.org/10.48550/arXiv.2209.10652
5. Burnham, K.P., Anderson, D.R., Burnham, K.P.: Model Selection and Multimodel Inference:
A Practical Information-Theoretic Approach, 2nd edn. Springer, New York (2002)
6. Kopp, W., Akalin, A., Ohler, U.: Simultaneous dimensionality reduction and integration for
single-cell ATAC-seq data using deep learning. Nat. Mach. Intell. 4(2), 162–168 (2022)
7. Sarveniazi, A.: An actual survey of dimensionality reduction. Am. J. Comput. Math. 04(02),
55–72 (2014)
8. Sinharay, S.: An overview of statistics in education. In: Peterson, P., Baker, E., McGaw, B.
(eds.) International Encyclopedia of Education, 3rd edn., pp. 1–11. Elsevier, Oxford (2010)
9. Goldengorin, B., Krushinsky, D., Pardalos, P.M.: Cell Formation in Industrial Engineering,
Springer Optimization and Its Applications, vol. 79. Springer, New York (2013)
10. Street, W.N., Wolberg, W.H., Mangasarian, O.L.: Nuclear feature extraction for breast tumor
diagnosis. In: Acharya, R.S., Goldgof, D.B. (eds.) Biomedical Image Processing and Biomed-
ical Visualization, vol. 1905, pp. 861–870. International Society for Optics; Photonics; SPIE
(1993)
11. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugenics 7(2),
179–188 (1936)
72 T. M. Chikake and B. Goldengorin
12. Maaten, L.J.P.v.d., Hinton, G.E.: Visualizing high-dimensional data using t-SNE. J. Mach.
Learn. Res. 9(Nov), 2579–2605 (2008)
13. Jolliffe, I.: Principal Component Analysis. Springer (2002)
14. Jiang, H., Eskridge, K.M.: Bias in principal components analysis due to correlated observa-
tions. In: Conference on Applied Statistics in Agriculture (2000)
15. Bengio, Y., Monperrus, M., Larochelle, H.: Nonlocal estimation of manifold structure. Neural
Comput. 18(10), 2509–2528 (2006)
16. McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. Wiley Series in
Probability and Mathematical Statistics. Wiley, New York (1992)
17. Singh, S., Silakari, S.: Generalized discriminant analysis algorithm for feature reduction in
cyber-attack detection system. Int. J. Comput. Sci. Inform. Secur. 6(1), 173–180 (2009)
18. McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection
for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)
19. Boros, E., Hammer, P.L.: Pseudo-boolean optimization. Discrete Appl. Math. 123(1–3), 155–
225 (2002)
20. AlBdaiwi, B.F., Ghosh, D., Goldengorin, B.: Data aggregation for p-median problems. J.
Comb. Optim. 21(3), 348–363 (2011)
21. Sarkar, T.: Xbnet: An extremely boosted neural network. Intell. Syst. Appl. 15, 200097
(2022). https://doi.org/10.1016/j.iswa.2022.200097. https://www.sciencedirect.com/science/
article/pii/S2667305322000370
22. William Wolberg, O.M.: Breast Cancer Wisconsin (Diagnostic) (1993). https://doi.org/10.
24432/C5DW2B. https://archive.ics.uci.edu/dataset/17
23. Bennett, K.P.: Decision tree construction via linear programming. Computer Sciences Techni-
cal Report # 1067, University of Wisconsin, 14 pp., January (1992)
Pseudo-Boolean Polynomials Approach to
Edge Detection and Image Segmentation
1 Introduction
In digital image processing and computer vision, image segmentation is the process
of partitioning a digital image into multiple image segments, also known as image
regions or image objects [1]. The process has close relationship to blob extraction,
which is a specific application of image processing techniques, whose purpose is to
isolate (one or more) objects (aka. regions) in an input image [2].
The goal of segmentation is to simplify and/or change the representation of an
image into something that is more meaningful and easier to analyse [1].
There exist classical and AI based methods for the purpose of segmentation/blob-
extraction. These methods converge into semantic [3], instance [4] and panoptic [5]
segmentation. The underlying techniques of these methods can be classified into
threshold filtering, clustering, differential motion subtraction, histogram, partial-
differential equation solving, graph partitioning, supervised neural-network asso-
ciation and edge detecting methods. Our proposed method lies at the intersection of
edge detection and threshold filtering methods.
T. M. Chikake · A. Samosyuk
Department of Discrete Mathematics, Moscow Institute of Physics and Technology, Moscow,
Russian Federation
e-mail: [email protected]; [email protected]
B. Goldengorin ()
Department of Mathematics, New Uzbekistan University, Tashkent, Uzbekistan
The Scientific and Educational Mathematical Center «Sofia Kovalevskaya Northwestern Center
for Mathematical Research», Pskov State University, Pskov, Russia
Department of Discrete Mathematics, Phystech School of Applied Mathematics and Informatics,
Moscow Institute of Physics and Technology, Dolgoprudny, Moscow region, Russia
e-mail: [email protected]; [email protected]
2 Related Work
3 Methods
After the generation of these matrices, most of the processing: local aggregation,
reduction of columns, and p-truncation processes are heavily dependent on the
terms’ matrix.
Using an example instance to illustrate the formulation process, we take a .4 × 5
patch from an image.
Let
⎡ ⎤
8 8 8 5
⎢12 7 5 7⎥
.C = ⎢ ⎥
⎣18 2 3 1⎦
5 18 9 8
The terms encoding function takes as input the permutations (.) matrix which is
an index ordering of the input matrix.
⎡ ⎤
4 3 3 3
⎢1 2 2 1⎥
. = ⎢ ⎥
⎣2 1 1 2⎦
3 4 4 4
y1 y2 y4 y1 y2 y3 y1 y2 y3 y1 y2 y3
78 T. M. Chikake et al.
We then perform local aggregation by summing similar terms and get a compact
representation of the initial instance as
⎡ ⎤
0 0 11
⎢ 0 11y3 3y4 ⎥
.⎢ ⎥
⎣2y1 y3 4y2 y3 4y1 y4 ⎦
0 12y1 y2 y3 6y1 y2 y4
which in this particular example has 50% less cost compared to the initial instance
when expressed as polynomial.
We then search for the equivalent matrix with the minimum number of columns
and in this particular example, it is already reduced to this state.
When pseudo-Boolean polynomials are reduced to their smallest instance, an
equivalence property is apparent because different instances of similar information
converge into a similar reduced pseudo-Boolean polynomials.
This property is central to the blob aggregation task, and in this task these regions
of equivalence in the reduced pseudo-Boolean polynomials context, occupy the set
of blobs.
Below are examples of cost matrices, which are initially different but converge
into similar reduced instances.
⎡ ⎤
138 138 138 136
⎢139 139 138 137⎥
.⎢ ⎥
⎣142 141 139 138⎦
142 140 139 138
and
⎡ ⎤
136 136 138 140
⎢138 137 138 140⎥
.⎢ ⎥
⎣140 139 140 141⎦
139 139 140 141
reduce to
Pseudo-Boolean Polynomials Approach to Edge Detection and Image Segmentation 79
⎡ ⎤
550
⎢ 3y1 ⎥
.⎢ ⎥
⎣ 6y1 y2 ⎦
1y1 y2 y4
alter how fine/course should our edges be. If the degree .r of the pseudo-Boolean
polynomial calculated on the patch is higher than .p then the patch lies over an edge
or blob otherwise. The edge/blob classifier if simply
edge if r < p,
.f (r, p) = (1)
blob otherwise
80 T. M. Chikake et al.
Given an image, broken into small patches of size .4 × 4, the maximum possible
degree is .3, and we can select .p = 1, such that we have a binary set of patches,
where patches, whose pseudo-Boolean polynomials reduce to a constant are
described as blob regions, and the rest as edge points.
Using equivalence [6], we can fine-tune the classified edges, should neighbouring
patches exhibit contrasting edge/blob classes by equating the blobs in favour of the
edge or vice-versa depending on how fine we require our edges to be.
4 Experimental Setup
Given an image of size .200 × 200 containing basic primitive shapes of continuous
colour shown in Fig. 2a, we extract patches of significantly smaller sizes, e.g. .6 × 6
as shown in Fig. 2b.
We then apply the formulation and reduction of pseudo-Boolean polynomials on
each patch and group them into a binary set .S = {Blob, Edge} based on the pseudo-
Boolean polynomial degree.
Patches whose pseudo-Boolean polynomial degree .r < p are considered Blobs
and Edge otherwise.
Patches with constant pixel values .x ∈ [0, 255] like
⎡ ⎤
99 99 99 99
⎢99 99 99 99⎥
.⎢ ⎥
⎣99 99 99 99⎦
99 99 99 99
. 396
Fig. 2 Input image of primitives and its patching process. (a) Input image. (b) Showing .6 × 6
sized patches
Pseudo-Boolean Polynomials Approach to Edge Detection and Image Segmentation 81
while some, with varied pixel values which cancel out like
⎡ ⎤
254 254 19 84
⎢254 254 19 84⎥
.⎢ ⎥
⎣254 254 19 84⎦
254 254 19 84
. 611
Consequently these patches are grouped among the set of patches whose informa-
tion cost is described as blob region.
There are patches which contain varied pixel data which may converge to high-
order pseudo-Boolean polynomials like
⎡ ⎤
254 254 6 17
⎢254 254 6 17 ⎥
.⎢ ⎥
⎣254 254 6 17 ⎦
254 254 6 123
and
⎡ ⎤
254 254 6 17
⎢254 254 6 123⎥
.⎢ ⎥
⎣254 254 6 123⎦
254 254 6 123
which converge to
531
.
106y1 y2 y3
transition of spatial features. For this reason we ported the use of a Gaussian
filter as a preprocessing step, followed by a pixel set aggregation step which is
essentially a multivalued threshold processing. Instead of raw pixels as input in our
patches, we group ranges into sets of pixels whose size depends on the variance of
pixel distribution in the image to promote group convergence of pseudo-Boolean
polynomials.
We create these .f (x) = sets of pixels as
⎧
⎪
⎪0 if x < 5,
⎪
⎪
⎪
⎪
⎪
⎪1 if x ∈ [5, 10),
⎪
⎪
⎪
⎪ if x ∈ [10, 15),
⎨2
.f (x) = ... (2)
⎪
⎪
⎪
⎪
⎪
⎪...
⎪
⎪
⎪
⎪...
⎪
⎪
⎩
51 if x > 250,
Pseudo-Boolean Polynomials Approach to Edge Detection and Image Segmentation 83
thereby reducing the information cost range from .[0, 255] to .[0, 51] for instance.
Natural images tend to have smooth transitions of pixel values for neighbouring
pixels at atomic level due to anti-aliasing in RGB representation of image data,
thereby reducing the chances of neighbouring pixel patches converging into equiv-
alent groups and consequently coarse edges but the Gaussian filter preprocessing
together with the grouping of pixel ranges promotes finer edges.
We apply our aggregation process on natural images and observe the need for the
Gaussian filter as well as the pixel set aggregation preprocessing to achieve useful
segmentation. Figure 4 shows input image of a noisy and natural image that we pass
through the segmentation processes without preprocessors as shown in Fig. 5, and
with processors as shown in Fig. 6.
As can be observed in Fig. 6, a Gaussian preprocessing and pixel grouping step
allow us to achieve better edge and blob extraction. A Set of pixels, each of size
40, limit our cost range from .[0, 255] to .[0, 7] and encourage pronunciation of
contrasting regions.
Additionally, we can observe that including a costly operation which aggregates
those reduced pseudo-Boolean polynomials into equivalent groups finds numerous,
insignificantly small and unuseful groups in the setups which exclude the prepro-
cessing steps, while larger equivalent groups can be aggregated in the setups which
involve all the preprocessing steps.
We apply our method with selected processing parameters on the Dubai landscape
dataset [17]. In this experiment we show that our method can extract edges of land-
scape features on satellite imagery which can be used for semantic segmentation.
Humans in the Loop published an open access dataset annotated for a joint
project with the Mohammed Bin Rashid Space Center in Dubai, UAE, which
consists of aerial imagery of Dubai obtained by MBRSC satellites and annotated
with pixel-wise semantic segmentation in 6 classes [17].
The full solution on this dataset requires placing labels on each segmented region,
however our current method can only segment regions on boundary edges.
The distinctive advantages of our method against the state-of-art neural network
based solutions in instance segmentation are:
• blind segmentation(no learning is involved, which can be prone to overfit-
ting/under fitting issues)
• faster and CPU friendly segmentation
• explainable mathematical steps to segmentation.
Pseudo-Boolean Polynomials Approach to Edge Detection and Image Segmentation 85
Our method is also not limited to a given dataset, since it works in a blind manner,
which does not require prior familiarity of related image data except for purposes
of choosing the threshold ranges. Provided with images which contain contrasting
features, we can guarantee that our method will propose segmentations of features
in pure mathematical and deterministic steps.
Figure 7 show an example processing of our method on an image sample in the
Dubai Landscape dataset [17].
Our method brings to limelight unbiased and fast computer vision in a segmenta-
tion task. The parameters required to achieve useful segmentation using our method
are limited to:
1. Gaussian filter kernel size,
2. pixel thresholding size and,
3. patch sizes.
The optimal choice of these parameters is the only limitations for the general-
ization of our proposed method, and we propose in our future work, an automized
process for selecting these parameters.
86 T. M. Chikake et al.
The performance of the method, based on the choice of the patch size parameter
is linearly dependent: the smaller the patch size, the finer the segmentation but
longer processing, and vice-versa.
6 Conclusion
Acknowledgments Tendai Mapungwana Chikake and Boris Goldengorin’s research was sup-
ported by Russian Science Foundation project No. 21-71-30005.
Boris Goldengorin acknowledges Scientific and Educational Mathematical Center “Sofia
Kovalevskaya Northwestern Center for Mathematical Research” for financial support of the present
study (agreement No 075-02-2023-937, 16.02.2023).
References
1. Shapiro, L.G., Stockman, G.C.: Computer Vision. Prentice Hall, Upper Saddle River, NJ
(2001)
2. Yusuf, M.D., Kusumanto, R., Oktarina, Y., Dewi, T., Risma, P.: Blob analysis for fruit
recognition and detection. Comput. Eng. Appl. J. 7(1), 25–36 (2018)
Pseudo-Boolean Polynomials Approach to Edge Detection and Image Segmentation 87
3. Guo, D., Pei, Y., Zheng, K., Yu, H., Lu, Y., Wang, S.: Degraded image semantic segmentation
with dense-gram networks. IEEE Trans. Image Process. 29, 782–795 (2020)
4. Yi, J., Wu, P., Jiang, M., Huang, Q., Hoeppner, D.J., Metaxas, D.N.: Attentive neural cell
instance segmentation. Med. Image Anal. 55, 228–240 (2019)
5. Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation arXiv.org
(2018). https://arxiv.org/abs/1801.00868v3
6. AlBdaiwi, B.F., Ghosh, D., Goldengorin, B.: Data aggregation for p-median problems. J.
Comb. Optim. 21(3), 348–363 (2011)
7. Chennupati, S., Narayanan, V., Sistu, G., Yogamani, S., Rawashdeh, S.A.: Learn-
ing panoptic segmentation from instance contours (2021). http://arxiv.org/abs/2010.11681.
ArXiv:2010.11681 [cs]
8. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN (2018). https://doi.org/10.48550/
arXiv.1703.06870. http://arxiv.org/abs/1703.06870. ArXiv:1703.06870 [cs]
9. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image
segmentation. Tech. Rep. arXiv:1505.04597 (2015)
10. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell.
PAMI-8(6), 679–698 (1986)
11. Liu, K., Xiao, K., Xiong, H.: An image edge detection algorithm based on improved canny. In:
Proceedings of the 2017 5th International Conference on Machinery, Materials and Computing
Technology (ICMMCT 2017). Atlantis Press, Beijing, China (2017)
12. Kong, H., Akakin, H.C., Sarma, S.E.: A generalized Laplacian of Gaussian filter for blob
detection and its applications. IEEE Trans. Cybern. 43(6), 1719–1733 (2013)
13. Assirati, L., Silva, N.R., Berton, L., Lopes, A.A., Bruno, O.M.: Performing edge detection by
difference of Gaussians using q-Gaussian kernels. J. Phys. Conf. Ser. 490, 012020 (2014)
14. Li, S., Liu, C., Wang, Y. (eds.): Pattern Recognition: 6th Chinese Conference, CCPR 2014,
Changsha, China, November 17–19, 2014. Proceedings, Part I, Communications in Computer
and Information Science, vol. 483. Springer Berlin Heidelberg, Berlin (2014)
15. Boros, E., Hammer, P.L.: Pseudo-boolean optimization. Discrete Appl. Math. 123(1–3), 155–
225 (2002)
16. Goldengorin, B., Krushinsky, D., Pardalos, P.M.: Cell Formation in Industrial Engineering,
Springer Optimization and Its Applications, vol. 79. Springer New York, New York (2013)
17. Loop, H.i.t.: Semantic segmentation dataset | humans in the loop (2020). https://
humansintheloop.org/resources/datasets/semantic-segmentation-dataset-2/. Accessed: 2022-
12-03
Purifying Data by Machine Learning
with Certainty Levels
1 Introduction
S. Dolev
Ben-Gurion University of the Negev, Be’er Sheva City, Israel
e-mail: [email protected]
G. Leshem ()
Ashkelon Academic College, Ashkelon, Israel
e-mail: [email protected]
collaborative classification when results from several trees are combined to a final
classification.
Previous Work In the Probably Approximately Correct (PAC) learning frame-
work, Valiant [14] introduced the notion of PAC learning in the presence of
malicious noise. This is a worst-case model of errors in which some fraction
of the labeled examples given to a learning algorithm may be corrupted by an
adversary who can modify both example points and labels in an arbitrary fashion.
The frequency of such corrupted examples is known as the malicious noise rate.
This study assumed that there is a fixed probability β (0 < β < 1) of an error
occurring independently on each request, but the error is of an arbitrary nature. In
particular, the error may be chosen by an adversary with unbounded computational
resources and knowledge of the function being learned, the probability distribution
and the internal state of the learning algorithm (note that in the standard PAC model
the learner has access to an oracle returning some labeled instance (x,C(x))for each
query, where C(x) is some fixed concept belonging to a given target class C and x
is a randomly chosen sample drawn from a fixed distribution D over the domain X.
Both C and D are unknown to the learner and each randomly drawn x is independent
of the outcomes of the other draws.
In the malicious variant of the PAC model introduced by Kearns and Li [8], the
oracle is allowed to ‘flip a coin’ for each query with a fixed bias η for heads. If
the outcome is heads, the oracle returns some labeled instance (x,) antagonistically
chosen from X × {−1,+1}. If the outcome is tails, the oracle is forced to behave
exactly like in the standard model returning the correctly labeled instance (x,C(x))
where x ∼ D (x is a drawn sample from the distribution D).
In both the standard and malicious PAC models the learner’s goal for all inputs
ε, > 0 is to output some hypothesis H ∈ H (where H is the learner’s fixed
hypothesis class) by querying an oracle at most m times for some m = m(ε, )
in the standard model, and for some m = m(ε, , η) in the malicious model. For
all targets C ∈ C and distributions D, the hypothesis H of the learner must satisfy
Ex∼D [H (x) = C(x)] ≤ ε with a probability of at least 1 − with respect to
the oracle’s randomization. We will call ε and the accuracy and the confidence
parameter, respectively. Kearns and Li [8] have also shown that for many classes
of Boolean functions (concept classes), it is impossible to accurately learn ε if the
ε
malicious noise rate exceeds 1+ε . In fact, for many interesting concept classes, such
as the class of linear threshold functions, the most efficient algorithms known can
only tolerate malicious noise rates significantly lower than this general upper bound.
Despite these difficulties, the importance of being able to cope with noisy data
has led many researchers to study PAC learning in the presence of malicious noise
[1–3, 5, 6, 9, 13]. In Servedio [13], a PAC boosting algorithm is developed using
smooth distributions. This algorithm can tolerate low malicious noise rates but
requires access to a noise-tolerant weak learning algorithm of known accuracy. This
weak learner, L, which takes as input a finite sample S of m labeled examples,
has some tolerance to malicious noise; specifically, L is guaranteed to generate
a hypothesis with non-negligible advantage provided that the frequency of noisy
Purifying Data by Machine Learning with Certainty Levels 91
examples in its sample is at most 10% and that it has a high probability to learn with
high accuracy in the presence of malicious noise at a rate of 1%.
Our Contribution We present a verifiable way to cope with arbitrary faults
introduced by even the most sophisticated adversary, and show that the technique
withstands this malicious (called Byzantine) intervention so that even in the worst
case scenario the desired results of the machine learning algorithm can be achieved.
The assumption is that an unknown part of a data-set is Byzantine, namely,
introduced to mislead the machine learning algorithm as much as possible. Our goal
is to show that we can ignore/filter the influence of the misleading portions of the
malicious data-set and obtain meaningful (machine learning) results. In reality, the
Byzantine portion in the data-set may be introduced by a malfunctioning device with
no adversarial agenda, nevertheless, a technique proven to cope with the Byzantine
data items will also cope with less severe cases. In this paper, we develop three new
approaches for increasing the certainty level of the learning process, where the first
two approaches identify and/or filter data items that are suspected to be Byzantine
data items in the data-set (e.g., a training file). In the third approach we introduce
the use of the certainty level for combining machine learning techniques (similar to
the previous studies).
The first approach fits best the case in which the Byzantine data is added to the
data-set, and is based on the calculation of the statistical parameters of the data-set.
The second approach considers the case where part of the data is Byzantine, and
extends the use of the certainty level for those cases in which no concentrations
of outliers are identified. Data-sets often have several features (or attributes) which
are actually columns in the training and test files that are used for cross-checks
and better prediction of the outcome in both simple and sophisticated scenarios.
The third approach deals with cases in which the Byzantine data is part of the data
and appear in two possible modes: where part of the data in a feature is Byzantine
and/or where several features are entirely Byzantine. The third technique is based
on decision trees similar to the Random Forest algorithm [4]. After the decision
trees are created from the training data, each variable from the training data passes
through these decision trees, and whenever the variable arrives to a tree leaf, its tree
classification is compared with its class. When the classification and the class are
in agreement, a right variable of the leaf is incremented; otherwise, the value of a
wrong variable of this leaf is incremented. The final classification for every variable
will be determined according to the right and wrong values. This enhancement of the
random forest is of an independent interest conceptually and practically, improving
the well known random forest technique.
Road Map The rest of the paper is organized as follows: In the next section
(Sect. 2), we describe approaches for those cases in which Byzantine data items
are added to the data-set, and the ways to identify statistical parameters when the
distribution of a feature is known. In Sects. 3 and 4, we present those cases in
which the Byzantine adversary receives the data-set and chooses which items to
add/corrupt. Section 3 describes ways to cope with Byzantine data in the case of a
single feature with a classification of a given certainty level. Section 4 extends the
92 S. Dolev and G. Leshem
use of the certainty level to handle several features, extending and improving the
random forest techniques. The conclusion appears in Sect. 5.
We start with the cases in which Byzantine data is added to the data-set. Our goal
is to calculate the statistical parameters of the data-set, such as the distribution
parameters of the uncorrupted items in the data-set, despite the addition of the
Byzantine data. Consider the next examples that derive the learning algorithm to
the wrong classification, where the raw data contains one feature (or attribute) of
the samples (1 vector) that obeys some distribution (e.g., normal distribution), plus
additional adversary data. The histogram that describes such an addition is presented
on the left side of Fig. 1, where the “clean” samples are inside the curve and the
addition of corrupted data is outside the curve (marked in blue). The corrupted
data items in these examples are defined as samples that cause miscalculation of
statistical parameters like .μ and .σ and as a result, the statistical variables are less
significant. Another case of misleading data added to the data-set, a special case to
the one above, is demonstrated on the right side of Fig. 1. The histogram of these
samples is marked in green, where the black vertical line that crosses the histogram
separates samples with labels .+1 and .−1. The labels of the misleading data are
inverted with relation to the labels of other data items with the same value. To
achieve our goal to calculate the most accurate statistical parameters for the feature’s
distribution in the sample population, we describe a general method to identify and
filter the histograms that may include a significant number of additional corrupted
data items.
Fig. 1 Histogram of original samples with additional corrupted data outside the normal curve but
in the bound of .μ ± 3σ (left), and outside the normal curve and outside the bound of .μ ± 3σ (right)
Purifying Data by Machine Learning with Certainty Levels 93
Method for Identifying Suspicious Data and Reducing the Influence of Byzan-
tine Data This first approach is based on the assumption that we can separate
“clean” data by a procedure based on the calculation of the .μ and .σ parameters
of the uncorrupted data. According to the central limited theorem, 30 data items
chosen uniformly, which we call a batch, can be used to define the .μ and .σ . Thus,
the first step is to try to find at least 30 clean samples (with no Byzantine data).
Note that according to the central limit theorem, the larger the set of samples,
the closer the distribution is to being normal, therefore, one may choose to select
more than 30 samples. We use n=30 as a cutoff point and assume that the sampling
distribution is approximately normal. In the presence of Byzantine data one should
try to ensure that the set of 30 samples will not include any Byzantine items.
This case is similar to the case of a shipment of N objects (real data) in which
m are defective (Byzantine). In probability theory and statistics, hypergeometric
distribution describes the probability that in a sample of n distinctive objects drawn
from the shipment, exactly k objects are defective. The probability for selecting k
items that are not Byzantine is:
mN −m
k n−k
P (X = k) =
. N (1)
n
Note that for clean samples k=0 and the equation will be
N −m
P (X = 0) = Nn
. (2)
n
In order to prevent the influence of the adversary on the estimation of .μ and .σ (by
addition of Byzantine data), we require that the probability in equation 2 will be
higher than 50.% (.P > 12 ). Additionally, according to the Chernoff bound we will
obtain a lower bound for the success probability of the majority of n independent
choices of 30-sample batches (thus, by a small number of batch samplings we will
obtain a good estimation for the .μ and .σ parameters of clean batches). The ratio
between N (all samples) to m (Byzantine samples) that implies a probability to
sample a clean batch that is greater than . 12 is presented in Fig. 2.
As demonstrated in Fig. 2 the ratio between N (all samples), and m (Byzantine
samples) is about 2.% (e.g., 20 Byzantine samples for every 1000 samples, so if this
Byzantine ratio is found by the new method (as described below) the probability
that any other column in the data-set will contain a Byzantine sample is very low (in
other words, the confidence that in every other column the samples are “clean” is
high)). Our goal is to sample a majority of “clean” batches to estimate statistical
parameters such as .μ and .σ of the non-Byzantine samples in the data-set. The
estimation of these parameters will be done according to an Algorithm 1.
94 S. Dolev and G. Leshem
4. end for
5. On the assumption that the distribution of the original data is normal or approximately normal,
the histogram of the estimated .μ and .σ is also approximately normal (according to the central
limit theorem). The probability to choose a “clean” batch is higher than 50.%, therefore, at least
50.% or more of the estimations are clean. The value of .μ̂ (and .σ̂ ) will be chose to be the median
of the .μ (and .σ ), thus ensuring that our choice has at least one clean batch with higher (and one
with lower) .μ (and .σ ,respectively).
. The Chernoff bound gives a lower bound for the success probability of majority agreement
for b independent, equally likely events, and the number of trials is determined according to the
following equation: .B ≥ 2(P −1/2)1
2 ln
√1 , where the probability .P > 1 and . is the smallest
2
probability that we can promise for an incorrect event (e.g., for the probability of a correct event
at a confidence level of 95.% or 99.%, the probability for an incorrect event, . , is 0.05 or 0.01,
respectively).
and the variance. Based on CLT, we were able to efficiently obtain (using Chernoff
bound) the expected value and the variance of the data item values. Next, for every
given number of data items, and type of distribution graph, the parameters of the
graph that will respect these values (expected value, variance, distribution type,
and number of data items) can be found. In the sequel, we consider the case of
a distribution type of graph which reflects the normal distribution. The next stage
for identifying suspicious data items is based on analysis of the overflow of data
items beyond the distribution curve (Fig. 1). The statistical parameters which were
found in the previous stage are used in the procedure described in Algorithm 2, the
procedure below:
Algorithm 2: Description of the first technique for removing suspected data items.
The suspicious bins, those with a significant overflow, are marked and will not be
considered for the training process of the machine learning. The data-set after the
cleaning process contains values from bins (in the data histogram) without overflow
(e.g., the ratio between the integral of the normal curve to the data items in the same
bin is approximately 1).
Note that when the number of extra data items in the bins (which was counted
during the “cleaning” process) with overflow (data items outside the integral curve)
is higher than 2.% of the whole data-set, we can assume that the other bins are clean.
The next section deals with the remaining uncertainty.
Experiments and Results We present results to demonstrate the qualities of our
new approach, and to reveal the differences between a data-set with and without
corrupted data in order to validate the proposed method experimentally. The
96 S. Dolev and G. Leshem
comparison between the data-sets was done with the C4.5 algorithm [12]. C4.5
uses a decision tree developed by Quinlan [11], and is an extension of Quinlan’s
earlier ID3 (Iterative Dichotomiser 3) algorithm [10]. The decision trees generated
by C4.5 can be used for classification, and for this reason C4.5 is often referred to
as a statistical classifier. For this comparison we used an available tool based on the
C4.5 file from “Classification Toolbox for Matlab”. The results appear in Table 1.
The data-set for the first experiment is an artificial data-set created by Matlab
software with the function “normrnd” (which generates random numbers from a
normal distribution with a mean parameter .μ and a standard deviation parameter
.σ )—this database consist of uniformly distributed data and contains 1 attribute
(2296 samples) and 2 classes (.−1,.+1). Note that in the presence of Byzantine
data items the resulting data-set must be a worse case with relation to the original
data items and the learning algorithm. Here we deliberately introduced a corrupt
data-set that includes Byzantine data to demonstrate the benefits of the proposed
technique. For the second and third experiments, the previous database is extended
with corrupt data (such as the data-set shown in Fig. 1) that also has an inverted
label in relation to the label with the same value (2296 “clean” plus 45 “corrupted”,
2341 samples in total).
We continue considering the case where part of the data in the feature is corrupted.
Our goal in this section is to find the certainty level of every sample in the
distribution in the case where the upper bound on a number of corrupted data items
is known. This section is actually a continuation of the previous, as both sections
deal with a single feature, where the first deals with an attempt to find overflow of
samples and the second, cope with unsuccessful such attempts; either due to the
fact that the distribution is not known in advance, or that no overflows are found.
The histogram of these samples is colored green, where the black vertical line that
crosses the histogram separates samples with labels .+1 and .−1. The labels of the
Byzantine data have an inverted label with relation to the label of the non-Byzantine
Purifying Data by Machine Learning with Certainty Levels 97
data items with the same value. To achieve our goal we describe a general method
that bounds the influence of the Byzantine data items.
Method to Bound the Influence of the Byzantine Data Items The new approach
is based on the assumption that an upper .ξ on the number of Byzantine data items
that may exist in every bin in the distribution is known (e.g., maximum .ξ equals 8
items). The certainty level .ζ of each bin is calculated by the following equations:
L−1 − ξ
ζ−1 =
. (3)
N
L+1 − ξ
ζ+1 =
. (4)
N
Where .L−1 is the number of data items that are labeled as .−1, .L+1 is the number
of data items that are labeled as .+1, and N is the number of data items in the bin.
Algorithm 3: Description of the method for finding the certainty level of every
sample for .ξ Byzantine data items in every bin in the distribution.
Our last contribution deals with the general cases in which corrupted data are part of
the data-set and can appear in two modes: (i) An entire feature is corrupted (Fig. 3),
and (ii) Part of the features in the data-set is corrupted and the other part is clean.
Note that there are several ways to corrupt an entire feature, including: (1) inverting
the classification of data items, (2) selection of random data items, and (3) producing
classifications inconsistent with the classifications of other non-corrupted features.
Our goal, once again, is to identify and to filter data items that are suspected to
be corrupted. The first case (i) is demonstrated by Fig. 3, where the raw data items
contain one feature and one vector of labels, where part of the features are totally
98 S. Dolev and G. Leshem
Fig. 3 Histogram of original samples with corrupted data inside the normal curve
non-corrupted and part are suspected to be corrupted (for all samples in this column
there is a wrong classification).
Method to Bound the Influence of the Corrupted Data Items Our technique is
based on the Random Forest; like the Random Forest algorithm [4] we use decision
trees, where each decision tree that is created depends on the value of a random
vector that represents a set of random columns chosen from the training data.
Large numbers of trees are generated to create a Random Forest. After this forest
is created, each instance from the training data set passes through these decision
trees. Whenever a data set instances arrives to a tree leaf, its tree classification is
compared with its class (.+1 or .−1); when the classification and the class agree the
right instance of the leaf is incremented; otherwise the value of the wrong instance
of this leaf is incremented, e.g., 351 instances were classified by Node 5 (leaf): 348
with the right classification and 3 with the wrong classification (Fig. 4).
Certainty Adjustment Due to Byzantine Data Bound The certainty level .ζ of
each leaf can be calculated based on the assumption that the upper bound on
the number of corrupted data items .ξ at every leaf in the tree is known. These
calculations are arrived at using equations 3 and 4, where, .L−1 is the number of
Purifying Data by Machine Learning with Certainty Levels 99
Fig. 4 Example of a decision tree for predicting the response for the instances in every leaf with
right or wrong classification
variables (in the leaf) that are labeled as .−1, .L+1 is the number of instances (in
the leaf) that are labeled as .+1, and N is the total number of variables that were
classified by the leaf.
In the second step, each instance from the test data set passes through these decision
trees to get its classification. Each new tested instance will get a classification result
and a confidence level, where the confidence level is in the terms of the (training)
right and wrong numbers associated with the leaf in the tree. The final classification
is a function of the vector of tuples .classif ication; right; wrong; with refer-
ence to a certainty level rather than a function of the vector of .classif ication
which is used in the original Random Forest technique. In this study we show one
possibility for using the vector of .classif ication; right; wrong; , though other
functions can be used as well to improve the final classification.
Algorithm 4: Description of the method for identifying and filtering Byzantine
data for multi-feature data-sets.
We tune down the certainty in each leaf using a given bound on the cor-
rupted/Byzantine data items. The contribution of this part includes a conceptual
improvement of the well known random forest technique; by re-examining all data
items in the data set. The re-examination counts the number of right and wrong
classifications in each leaf of the tree.
100 S. Dolev and G. Leshem
In this work we present the development (the details of the experiment results appear
in [7] of three methods for dealing with corrupted data in different cases: The first
method considers Byzantine data items that were added to a given non-corrupted
data set. Batches of uniformly selected data items and Chernoff bound are used to
reveal the distribution parameters of the original data set. The adversary, knowing
our machine learning procedure, can choose, in the most malicious way on, up to
the 2.%. malicious data; Note, that there is no requirement for the additional noise
to come from distribution different than the data items distribution. We prove that
the use of uniformly chosen batches and the use of Chernoff bound reveals the
parameters of the non-Byzantine data items. We propose to use certainty level that
takes into account the bounded number of Byzantine data items that may influence
the classification. The third method is designed for the case of several features, some
of which are partly or entirely corrupted. We present an enhanced random forest
technique based on certainty level at the leaves. The enhanced random forest copes
well with corrupted data. We implemented a system and show that ours performs
significantly better than the original random forest both with and without corrupted
data sets; we are certain that it will be used in practice.
In the scope of distributed systems, such as sensor networks, the methods can
withstand malicious data received from a small portion of the sensors, and still
achieve meaningful and useful machine learning results.
102 S. Dolev and G. Leshem
References
1. Aslam, J., Decatur, S.: Specification and simulation of statistical query algorithms for efficiency
and noise tolerance. J. Comput. Syst. Sci. 56, 191–2087 (1998)
2. Auer, P.: Learning nested differences in the presence of malicious noise. Theor. Comput. Sci.
185(1), 159–175 (1997)
3. Auer, P., Cesa-Bianchi, N.: On-line learning with malicious noise and the closure algorithm.
Ann. Math. Artif. Intel. 23, 83–99 (1998)
4. Breiman, L.: Random forests, Statistics department. Technical report, University of California,
Berkeley (1999)
5. Cesa-Bianchi, N., Dichterman, E., Fischer, P., Shamir, E., Simon, U.H.: Ample-efficient
strategies for learning in the presence of noise. ACM 46(5), 684–719 (1999)
6. Decatur, S.: Statistical queries and faulty PAC oracles. In: Proc. Sixth Work. on Comp.
Learning Theory, pp. 262–268 (1993)
7. Dolev, S., Leshem, G., Yagel, R.: Purifying data by machine learning with certainty levels.
Technical Report August 2009, Dept. of Computer Science, Ben-Gurion University of the
Negev (TR-09-06)
8. Kearns, M., Li, M.: Learning in the presence of malicious errors. SIAM J. Comput. 22(4),
807–837 (1993)
9. Mansour, Y., Parnas, M.: Learning conjunctions with noise under product distributions. Inf.
Proc. Lett. 68(4), 189–196 (1998)
10. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
11. Quinlan, J.R.: Induction of decision trees. In: Machine Learning (1986)
12. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Burling-
ton (1993)
13. Servedio, A.R.: Smooth boosting and learning with malicious noise. J. Mach. Learn. Res. (4),
633–648 (2003)
14. Valiant, G.L.: A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984)
On Impact of Data Models
on Predictability Assessment of Time
Series
Sergey Frenkel
1 Introduction
S. Frenkel ()
Federal Research Center “Computer Science and Control” R£S, Moscow, Russia
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 103
B. Goldengorin, S. Kuznetsov (eds.), Data Analysis and Optimization, Springer
Optimization and Its Applications 202,
https://doi.org/10.1007/978-3-031-31654-8_7
104 S. Frenkel
models, and the main hopes for getting a suitable result are placed on computing
power.
In a more general sense, in many modern tools based on neural networks (NN),
the formation of a forecast of data values at points in time in the future is performed
as a search for internal patterns and relationships without using any classical
mathematical procedures (Markov models or other probability theory models) or
underlying formal theories. Training is carried out on examples, and not on a
mathematical model, although using various mathematical tools, such as linear or
logistic regression, at intermediate stages – for example, algorithms based on search
trees and boosting (XGB, etc.) [3]. These circumstances are often supplemented by
the opacity of the transformation of the input data (i.e., the construction of features,
for example, in deep learning algorithms [4]), which makes it difficult to a priori
assess the impact of certain input data properties on the prediction efficacy. As the
analysis shows, this can lead to incorrect forecasts, to the inefficiency of the tools
and procedures of the forecasting subsystems as part of information and computing
systems.
Therefore, it is important to make a preliminary assessment of how efficient a
particular forecasting tool is from the point of view of the designer of a specific
software forecasting tool (“Prediction instrument” – PI). Hereinafter, PI refers to
software implementations of prediction algorithms.
Most research on network traffic forecasting has focused on classical statistical
methods that rely heavily on the use of past (“historical”) data, and the time spent
on obtaining a forecast depends largely on the computational complexity of the
algorithm relative to the volume of this history.
In particular, in recent years there has been a significant increase in the number
of studies using computational approaches, including machine learning methods,
to predict various activities in complex computing systems – from user response
on websites for various purposes, to traffic through it and/or workload prediction.
However, the existing literature mainly focuses on the analysis of algorithms
implemented in modern software products for solving specific problems, and there is
no comprehensive review to answer many important questions, such as the impact of
the different types of data models used and the specifics of the mathematical models
underlying the basis of the algorithms used [3].
At the same time, most models assume that the more observations are used,
the more accurate (better) the forecast. Therefore, the problem of high accuracy
is associated with the problem of computational costs, and hence the speed. In
cases where prediction time is critical, for example, for real-time network traffic
prediction problem, the task of prediction is to minimize computational costs as
much as possible while maintaining an acceptable level of accuracy (efficiency) of
the predictor.
One way to solve this problem is to take into account and use the properties
of mathematical time series models that model traffic and job flows, which can
significantly affect the prediction. One way to ensure that the abundance of modern
approaches and models is taken into account is to develop a general conceptual
model of the prediction problem, which is actually absent in the modern literature.
On Impact of Data Models on Predictability Assessment of Time Series 105
This paper offers an overview of the indicated mathematical models for predict-
ing the behavior of complex systems, primarily telecommunication systems.
Our main goal is to analyze the requirements for mathematical models of data
and systems with processes occurring in them, whose behavior is required to be
predicted using certain software tools, taking into account the availability of data on
the functioning and typical ways of user actions with prediction tools.
We propose a certain categorization of modern algorithms and forecasting
methods based on probabilistic models, primarily focusing on current progress in
machine learning methods.
The probabilistic data models used in modern approaches for various prediction
problems are considered and analyzed, and the conditions and requirements for
prediction models/algorithms are formulated in the framework of probabilistic
models for prediction, from the point of view of harmonization (coordination)
both (data and prediction algorithms) models. Various formal-logical (semantic,
ontological methods) [5] remain outside our field of vision.
A conceptual model of the process of selecting predictors for a specific data set
is proposed and substantiated, on the basis of which a system of recommendations
and a selection scheme can be built.
Various examples of the influence of the characteristics of random sequences
and processes on the accuracy of prediction are considered and analyzed, including
problems of predicting the sign of increments of random series and processes.
We consider from a unified standpoint various prediction models that are
traditionally considered within the framework of various sections of theoretical
computer science [6], probabilistic disciplines [6, 22], and Machine learning [3, 4].
Since there have been many reviews recently on the practical use of various
professional prediction software [3, 4, 9], this paper does not provide a detailed
overview, but only links to sources detailing the relevant software tools.
in others, local estimates are needed. There are prediction algorithms (predictors)
that use numerical (measured) data without any assumptions about their statistical
and probabilistic properties (like many algorithms that use the concept of a neural
network [3]), and there are implicitly assuming data as samples from a normal
stationary process, or as transitions of a Hidden Markov Net (i.e. generated by a
Markov source [8]).
In this regard, we fix the following approaches to the problem of prediction
present in the literature:
– as one of the mathematical problems of estimating unknown conditional proba-
bility distributions (“theoretic-probabilistic approach”),
– as one of the mathematical problem of choosing the optimal solution DM
(decision making) approach,
– as an algorithmic control problem.
In principle, the practical application of prediction models requires a combination
of these approaches, which in one form or another must be implemented in practice,
although in the literature these approaches exist for the most part isolated. In my
opinion, this is bad both for practice and for teaching students.
In the Probabilistic approach, the task of prediction is as follows:
Let at the time t, the conditional probabilities γ(xt + 1 |x1 ,x2 ,.,xt ) (or the probabil-
ity distribution density P(xt + 1 |x1 , . . . , xt )) are estimated for the implementation of
the process x1 ,x2 , . . . ,xt . In other words, estimates of how likely it is that at time
t + 1 we will see the value that we predict. It is clear that the more accurate (in the
accepted metric) the estimate, the better the forecast. At the next time t + 1, we have
to estimate the probabilities γ(xt + 2 |x1 . . . xt + 1 ) (or the density p(xt + 2 |x1 , . . . ,
xt + 1 )) and so on. Mathematical model estimates γ(xt + 1 |x1 ,x2 ,.,xt ), as well as the
conditional probability γ() itself, is called a “predictor” in the literature [14].
Along with γ(), predictors are also called software products (for example, some
modules of cloud MS AWS Azure ML, Google Cloud ML, etc.) that are used for
prediction.
In the future, these software products will be called as previously (Introduction),
“Prediction instrument”)PI) which may be a subsystem of intelligent decision
support (IDS), and the word “predictor” will be used in both senses with the
necessary explanations, if this does not directly follow from the context.
In the DM formulation, the following prediction problem is formulated, taking
into account the requirement for the efficiency (accuracy) of the forecast: for a
given space of strategies B, the space of possible predictable values X, and the loss
function l(b, x), b ∈ B, x ∈ X, choose the predicted next value xt = bt given the
knowledge of past states x1 , . . . ,xt − 1 (from X), so that the total the loss from the
prediction error (“loss”) would be asymptotically close to the loss obtained by the
best fixed strategy known a posteriori after looking at the entire sequence x1 , . . .
xt . – i.e. a well-chosen (predicted) bt should minimize possible a posteriori losses.
The strategy is a function that returns an output vector y for each input vector x
according to the conditional distribution function P(y|x) (or γ()) [14, 16].
On Impact of Data Models on Predictability Assessment of Time Series 107
The third statement of the problem, which we call the “algorithmic model” of
prediction, consists in using various conceptual models for obtaining a prediction
that do not explicitly contain any basic elements of mathematical models of the
above two mathematical formulations of the prediction problem. In terms of content,
conceptual models underlie the input language for describing the task of obtaining a
forecast (more details below), in particular, the parameters of called functions from
certain machine learning libraries [9, 17, 50], which can specify the structure of the
predictor used. For example, a search tree in eXtreme Gradient Boosting Regressor
aka XGBRegressor, if the user conceptual model considers the forecasting process
as a search for an acceptable (by the selected performance criterion) prediction
among a set of possible values in a fixed data area, or as a structure of a neural
network (number of layers, excitation functions etc.), for example, using the LSTM
model. XGB, from the point of view of the user of the prediction tool, is a gradient
boosting machine that reads some dataset, applies to it various methods of predicting
(not necessarily using any probabilistic models) the future values of the time series,
using an iterative procedure to adaptively change the samples from the training data,
trying different ways of predicting the results of evaluating misclassified records.
This can be seen as incremental addition of prediction methods (referred to as
“prediction models”) until further improvement is made. These “prediction models”
can be developed based on any of the above two classes of mathematical models,
but they are not present in the description (input language) of the algorithmic model
in any way – from the point of view of XGBRegressor, these are just more or less
successful predictors (oracles).
There are dozens of other algorithmic models that operate from the user’s point
of view as a solution (best prediction) search engine, for example, using Linear
Regression, Neural Networks;. Support Vector Machines (SVMs), Artificial Neural
Networks, etc.
In order to have a tool for describing the options for using these approaches to
solving prediction problems, we define a conceptual model of prediction algorithms,
which must be consistent with a mathematical (probabilistic!) model, i.e. provide a
On Impact of Data Models on Predictability Assessment of Time Series 109
representation of all the necessary elements of the subject area of a given prediction
problem in a mathematical model of the predictor and it allows a priori assessment
of its effectiveness (one or another measure of prediction error, for example). An
example (and source) of possible “inconsistency” for the binary sequence predictor
model is the fact that according to [14]:
for any predictor, there is a stationary and ergodic source such that the error
|P(xt + 1 |x1 . . . xt ) – P (xt + 1 |x1 . . . xt )| does not go to 0 when the length of the
observed sequence t goes to infinity (Here P() and P ( ) are the true probabilistic
distribution and its estimate respectively).
This means that for any binary sequence {xt } that models this or that data, for any
predictor of binary data, it may turn out that the predictor chosen for prediction does
not guarantee that the error in estimating the probability distribution of the predicted
value x *t + 1 with respect to the distribution of the true value xt + 1 converges to 0
as the length of the training sequence increases.
In this sense, this predictor is inconsistent with the data model as a random
sequence {xt } (say, Markovian), and to overcome this inconsistency, the description
of the QM must include the concepts of describing the probabilistic model that
explicitly or implicitly underlies the software implementation of the predictor. This
means that, at a minimum, one should always have a set of predictors, in the hope
that among them there will be one for which the specified property does not hold
(the difference between the specified distributions will converge to zero relatively
quickly), and the conceptual model should include a description of the choice
(enumeration) predictors.
Another example of possible inconsistency between data models and predictor
mathematical models is the presence of a non-linear relationship between past and
future values among the data (for example, self-similarity [15]). If this circumstance,
i.e., a possible mathematical model of the data, is not represented in the conceptual
model of the proposed prediction method, then there is an obvious uncertainty in
the choice of an effective algorithm used in the available toolbox.
So, the conceptual model in our understanding should be the top level description
of a software product that performs the prediction of the “future” from known data,
which we call the Predictor. Within the framework of the ML paradigm, we consider
that the predictor is being trained, and then, based on his training, forms a forecast.
Training is on some sequence, and the goal is to tune the parameters of the model
implemented by this program to predict the future part of the (not yet observed)
sequence. In practice, this may consist not only in tuning the parameters of one
particular model, but also in choosing a set, predictor, and/or training samples from
a certain region, and the goal is to predict new sequences from the same region as
accurately as possible.
110 S. Frenkel
where Mθ D (S) is a prediction model defined by the data type D (binary, real), with
a probability measure θ on D, and a structure S on D, which refers to the way data
of type D is structured as predictor input variables (for example, splitting data for
training and test choices, or scalar or vector data, etc. [9, 11, 50]),
ML is a model of loss (“PENALTY”) from the received prediction with loss
function L(XS ,YS ), where XS , YS ∈ D observed (XS ) and YS predicted data with
structure S given by some ratio on D × D (e.g., “success rate”, the ratio of the
number of successful predictions to the total number of predictions), related with a
measure of predictability [8, 14, 21] and prediction efficiency,
FLM = {f1 ,..fm } is a set of predictors based on the models Mθ D (S) and ML –
consisting of predictors in the sense defined above, i.e. methods for estimating the
future value with the conditional probability distribution of the predicted value.
We assume that the predictor fp always corresponds to an admissible prediction
model, i.e. all its input data and parameters can be uniquely determined within
the framework of the model under consideration at the moment, and therefore are
consistent in the sense indicated above (as it was introduced in the Sect. 3.1 for the
concept of CM).
On Impact of Data Models on Predictability Assessment of Time Series 111
and in the case, Prob(0) = Prob(1) = 1/2, the prediction is not performed.
In this case, as it is easy to see, the average losses are equal to the error probability
1-, when > 1/2, and , if 0 is considered “success”, which can be expressed as:
Lp = min(, 1 − ) (3)
Instead of a single predictor, we can consider the actions of some subset of FML
predictors.
In this case, the conceptual model for binary data (D = {1,0}n ) can generate the
well-known “multi-expert” prediction scheme, in which binary prediction is viewed
as a game between the predictor and the environment, and predictors (whose goal is
to minimize expertise loss) that fulfills the prediction.
Each expert F is a sequence of functions Ft : {0,1}t-1 ➔ {0,1}, t ≥ 1, i.e. expertise
is a way of setting a probability distribution on a set of sequences {0,1}t-1 . Each
expert defines a forecast strategy as follows: when observing with the first t-1 bits
y1 ,. yt-1 expert F predicts that the next bit of y is 1 with probability Ft (yt-1 ) [16].
112 S. Frenkel
It is easy to see that the content of the conceptual model of this approach does
not differ from the previously considered view of prediction as an estimate of the
conditional distributions of certain events associated with a random binary sequence
(with possible differences in the prediction efficiency estimates used).
Let us now consider what requirements should be met by mathematical models of
data in the specific prediction problems, for example, traffic in telecommunication
networks, so that it is possible to ensure and control the prediction efficiency
in a given cycle (at a given step) of prediction, i.e. to ensure consistency with
the properties of the input data and the criterion making a decision about the
effectiveness of the considered algorithmic prediction model.
As it is easy to see from (1), the first issue that should be taken into account
when choosing/developing mathematical models of predictors is the probabilistic
measure used in a particular problem, and hence the probabilistic distribution model
considered on the data set D, and the measure of prediction accuracy determined by
the measure ML (2).
An obvious criterion for forecast accuracy is the distribution of the probability of
prediction error (which determines the risk of incorrect prediction), which obviously
depends on the probabilistic data model, and, formally, its assessment should be
based on knowledge of the probability distributions of the data, and models of
the relationship between the distribution of data and the probability of the correct
forecast. But usually these distributions are unknown.
From a formal point of view, the unknown distribution of the predicted data
means impossibility of a priori estimation of the accuracy (uncertainty) of the
prediction with a given prediction algorithm leads either to the requirements for its
assessment in the prediction process, or to the use of Bayesian methods [17] of the
distribution of the desired conditional probability p(y|x) predicting x from known
data y, which is also associated with technical difficulties.
This indicates the need for a certain theoretical model that allows traffic
prediction problems to be based not on specific knowledge of distributions, but on
knowledge. About general properties of distributions of certain class. However, the
ability to overcome this problem depends on the data type D.
This class of problems can be of interest both in itself and in solving problems of
predicting the sign of increments of discrete or continuous random processes (see
Sect. 5).
On Impact of Data Models on Predictability Assessment of Time Series 113
So, for discrete data from some finite alphabet A, one of the theoretical
approaches to overcome the problem of the lack of exact knowledge of the data
distribution is to use the concept of an individual sequence, which is considered as
existing in a single instance, and not as a sample trajectory from the ensemble, as
is customary in theory. Random processes. It is believed that it is generated by a
random source, usually a stationary ergodic one.
A natural question is: can one predict the next values of an individual sequence
by estimating a probability distribution based on the past, for example, a specified
proportion of correct predictions up to time t, and then minimizing the expected
loss in this assignment (i.e., acting as if future events actually happened with
an estimated probability). The answer is yes, if you use the so-called. “universal
scheme” predictor [14].
A well-known and used example of a universal measure for symbolic
(binary, in particular) sequences, which expresses the conditional probability
Prob(xt + 1 |x1 , . . . xt ) and is built on the basis of a universal code U, the redundancy
of which is asymptotically minimal for classes of Bernoulli and Markov sources
[14] (the most famous example is the code obtained as a result of applying the
Lempel-Ziv (LZ) compression algorithms [14]).
According to the well-known result [14], let U be a universal code for some set
of sources generating letters from the alphabet A, and the measure μU for each
word v in the alphabet A is given by the equality.
As shown in [14], under the assumption of equiprobable generation of binary
strings from the set U by some (hypothetical) data source, the conditional probabil-
ity Prob(xt + 1 |x1 , . . . xt ) can be expressed:
μU (v) = 2−|v| / 2−|U(u)|
u∈Av
where v is the considered string x1 , . . . ,xt from a given set of binary strings in
a given universal code (e.g., LZ), which provides compression as much as their
entropy allows, Av denotes the number of strings A|v| .
Then the measure μU is universal on , i.e., predicting the next value of any
sequence from the set given by sources from , about which there is reason to
assume that this is a Markov sequence, can be predicted with some minimum error
without knowledge the exact distribution parameters, as:
Note that the above LZ algorithm [14], which performs lossless character sequence
compression, can at the same time act as a predictor by determining xt + 1 as the
corresponding leaf in the partial match tree [14] with the conditional probability
induced by the stepwise algorithm parsing (Recall that one of the well-known
criteria for the randomness of a binary sequence is the possibility of its compression
(to a value corresponding to the unit of entropy per symbol, which is the randomness
criterion [18])).
114 S. Frenkel
γ (xt+1 |x1 , x2 , ., xt )) .
where nx1, . . . xt-1 is the number of occurrences of the letter a in the subsequences
x1 , . . . xt-1.
If A = {0,1}:
n1 , n0 are the number of ones and zeros in the sample of the size |t| = M.
It is important to note that the predicted probabilities cannot be equal to zero
even through a certain letter did not occur in the word x1, . . . , xt − 1, xt .
For example, for the sequence 01010, the Laplace predictor:
It was shown in [9] that for any source with probabilistic distribution P that generates
independent and identically distributed symbols from the alphabet A, Laplace
predictor error satisfies the inequality:
In other words, we can talk about the universality according to Cesare with such a
measure of error, the predictor γ is universal for the set of sources if for it this
expression = 0 for any ω ∈ . That is, if we know to which of the wide class of
(stationary ergodic) sources (generating data) can belong data, then we can, given
this equality to conclude that for any source ω ∈ , the value of the predictor γ(x1 ,
. . . ,xt ) approaches the probability ω(x1 . . . xt ), which means that the predictor is
universal (in the sense of averaging over the divergence KL on the interval of length
t).
From this equality, we can conclude that, in a certain sense, the universal measure
μ is a nonparametric estimate for an unknown conditional probability distribution
P.
Thus, if the KL-Cesaro estimate is relevant to the prediction problem under
consideration (for example, there is a monotonic relationship between γ() and some
predictive quality criterion, for example, in terms of the loss function), then θ in
the representation {Mθ D (S), ML , FLM }, we can consider as an arbitrary stationary
ergodic measure.
As it shown in [14], whatever the actual data generation mechanism, using a
universal approach (more precisely, estimating the quality of a prediction based on
it) does not perform much worse than any other possible forecasting method that
uses knowledge of the probability distributions of data.
Thus, using the results of the universal predictor theory, one can abstract from
the parameters of specific distributions in the course of prediction.
To better understand the implications of this result, let’s ask, does additional
information about the probabilistic properties of the data always improve the
prediction score?
116 S. Frenkel
A negative answer follows, for example, from the results of [19]. In [19]
was proved that for any finite i.i.d. a sequence of binary data in which each
outcome “success “or failure” (0 or 1) the conditional expectation of the proportion
of successes among the results that immediately follow a series of consecutive
successes will be strictly less than the corresponding conditional probability of
success (from which one determines expectation – depends on the occurrence of
at least one series of k consecutive successes within the first n – 1 trials, where n > 3
and 1 < k < n – 1).
Thus, the choice of the structure S, so that it includes k previous values of the
binary sequence, in this case, can lead to a worse prediction. It all depends on the
prediction model.
Correspondingly, attempts to use knowledge about the probabilistic properties of
binary sequences do not always lead to better forecasts, and the resulting forecast
depends on the accepted criterion (“optimal Bernoulli”, in the considered case (2)).
This suggests that additional information does not always lead to an increase
in the a priori probability of a correct forecast; here we are talking about a priori
probability, because explicitly a posteriori probability when choosing a prediction
method (or predictor) is not present – it can only be estimated indirectly, based on
the results of past work.
Let us consider how criteria are manipulated in the modern practice of MO-based
predictions to ensure independence from the knowledge of distributions.
As the analysis of the literature shows, another way to avoid the need to evaluate
the exact laws of distributions is to use the criterion of empirical risk minimization
(ERM) [20, 51].
The main idea behind ERM is that we cannot know the true “risk” when running
on a particular dataset because we don’t know the true distribution of the data the
algorithm will operate on, but instead we can measure its performance on an already
known dataset training data (“empirical” risk), and minimize it. Cloud computing
traffic prediction papers [20] show that the principle of risk minimization used by
time series prediction algorithms affects the accuracy of algorithms in different ways
in environments with different traffic models. ERM is rated as:
bt = argh∈H min Remp (h)
Remp (h) = i=1,n L h (x) , x /n
where L(h(x),x’) is a loss function that measures how much the prediction value
x’ = bt under hypothesis h∈H differs from the true value x = xt .
On Impact of Data Models on Predictability Assessment of Time Series 117
In other words, Remp is an estimate of the average loss calculated from n previous
observations. As shown in [21] the proportion of correct predictions averaged over
the number of observations is the estimate of the predictor γ(), and (as it is easy to
show) this estimate coincides with Remp for the specified “0/1loss” with Hamming-
type loss function L(a, b):
Let we have two random processes X(t) and Y(t) defined on an ordered set real
domain R with associated probability functions P and Q on the same result set. We
say that the two processes are causally similar [52] if:
In fact, the self-similarity property reflects the nonlinear relationship between the
past and the future.
The fact is that, for example, the self-similar behavior of traffic, and its long-term
correlation, although used as the basis for forecasting] [, but on the condition that
the predictor should not respond to short-term traffic changes. However, there are
situations where this short-term data is important, for example, at the beginning of
a DDoS attack [24].
At the same time, self-similarity means a nonlinear relationship between the
present and the future, since the repeatability of a form on different scales can
not be expressed by a linear transformation, since the transformation in time is
associated with the appearance of new harmonics in the spectral representation of
corresponding random process.
On Impact of Data Models on Predictability Assessment of Time Series 119
If in the discrete spectrum of a random process (time series) x(t) there is only
one spectral frequency harmonic ω, and at some future moment x(t + k) 2ω appears
(a “faster” component), then this can only be expressed with using non-linear
transformation of values x(t + k − m), . . . x (t + k − 1), m < k. This circumstance
can also make it inefficient to use a very large volume of observations.
From a more general point of view, self-similarity is a term from fractal theory
[25], which describes objects that visually look the same regardless of scale, which
is expressed in the fact that local signal patterns are repeated many times in a
whole time series on different (usually small) time scales, so the original set can
be reproduced from its smaller portion at suitable magnification.(Intuitively, this
property means the presence of a past-future connection structure that, from an
intuitive point of view, could increase the possibility of correct prediction). For
example, in a number of methods [24], self-similarity destruction is a predictive
feature – the methods are based on the observation that the presence of a DDoS
attack reduces the degree of self-similarity of normal traffic, since DDoS tools do
not generate self-similar traffic, and this is reflected in the traffic.
In the practice of telecommunications networks, self-similarity can be caused
by so-called “Elephant” connections [27], generating a continuous stream with
an extremely large total size of bytes created by, for example, a TCP stream or
other network channel protocols. These streams can use extremal share of the total
bandwidth over a period of time. Thus, it is they who will determine the state of
traffic during this period of time (say, the volume of transmitted packets per second,
or packet delay), and it is this period that affects the structure of traffic in this time.
Note that self-similarity is also called scale invariant under the transformation
x = bx, y = ay, if a curve F(x) is scale invariant under the transformation [28, 45]:
where the exponent H = log(a)/log(b) is called the Hurst exponent (also known as
Hurst parameter or simply H). The Hurst exponent H determines the time separating
correlated (for example, stronger than a certain threshold value) of random process
(e.g., traffic) samples from each other (more details below).
Hurst exponent:
R(n) is the range of the first n cumulative deviations from the mean S(n) is the series
(sum) of the first n standard deviations E(n) is the expected value n is the time span
of the observation (number of data points in a time series), C is a constant.
Practically, in nature, there is no limit to time, and thus H is non-deterministic
as it may only be estimated based on the observed data. The calculation of H is
considered in detail in [28].
This value shows the degree of presence of self-similarity in the time series, since
its value is the greater, the smaller the time shift between the same pairs of values in
this time series.
120 S. Frenkel
(The greater the delay between two identical pairs of values in the time series,
the smaller the Hurst coefficient.)
The presence of self-similarity properties raises the question of the possibility of
using it to increase the efficiency of predictors.
Therefore, let us briefly consider well-known approaches to mathematical mod-
els of time series with self-similarity.
1. Poisson models
Speaking about the possibility of using the Poisson flow model as a means of
modeling self-similarity, we note that in [29] it is shown that, for example, modeling
network traffic with the Poisson model, assuming that the packet length will tend
to smooth out by averaging over a long time scale, may be incorrect, precisely
because network traffic exhibits a long-term dependence. At the same time, uniform
sampling from heavy-tailed distributions can produce poor estimates, since a
relatively small number of samples can seriously affect the final estimates.
The slow decay of the variance of a random process (e.g., in the number of
arriving packets) as the scale of self-similar traffic increases is in stark contrast to the
mathematical structure provided by the Poisson simulation, in which the variance
of the arrival process decays as the square root of the scale measure.
Therefore, less traditional models and properties of random process consider for
the series with self-similarity.
2. Long Range Distance model
If a random process, for example, network traffic, can have the property of a long-
range (extended in time) dependence [23], in which the correlation is preserved
between events sufficiently remote in time corresponding to changes, then one
speaks of a Long Range Distance (LRD) model.
From the point of view of correlations, the time series Xt , t ∈ Z, is called long-
range dependent if its covariance function [25, 26, 27, 28]:
c > 0 is a constant.
The larger H, the stronger the time dependence, because the covariance function
decays more slowly at infinity (i.e., the correlation of events separated in time is
On Impact of Data Models on Predictability Assessment of Time Series 121
preserved at large H), i.e. the decay of the values of the autocorrelation function
is much slower than the exponential convergence typical of (short-range) classical
models such as autoregressive moving averages (i.e. if the data is represented by a
process with autocorrelation properties).
In the case of processes without LRD, there is no dependence (H = 0.5) and its
autocorrelation rk = 0 for the lags k ≥ 1.
The time series describing the traffic becomes self-similar due to the simultane-
ous influence of many sources (for example, as in the above [24]), and the aggregate
multi-level source traffic (throughput traffic) with a distribution of latency with
heavy tails for the time interval in which the source is active or inactive, can be
approximated by Fractional Brownian Motion (FBM) as shown in [25, 27, 45].
Therefore, FBM is considered as a natural tool for modeling the phenomenon of
self-similarity.
3. Fractional Brownian motion model
Fractional Brownian Motion (FBM) with the Hurst parameter H is a continuous
Gaussian process with an autocorrelation function. At the same time, for H > ½,
which, formally, means the LRD-property. But in contrast to the classical Brownian
motion, the increments FBM (of the difference process) do not have to be indepen-
dent, and knowledge of the law of this dependence (for example, the conditional
distributions indicated above) in some cases could improve the prediction (as
prediction of a process with known prediction) with continuous time on [0, T] that
has zero (conditional!) expectation for all t in [0, T] (see the possible use of this
property for prediction in Sect. 5 of this paper).
Note that in terms of using process properties for prediction, a Brownian Motion
(BM) without “fractionality” is a motion in which the process value changes with
random increments over time, and the prediction is determined by this randomness.
At the same in this process there is a certain “memory”, which means the
dependence between the past and the future.
From a formal point of view, BM is the integral of white noise. These motions
define paths that are random but (statistically) self-similar, i.e. the approximate
trajectory (section) of the particle’s motion (“outlining” the implementation of a
random process) resembles the entire path, and this is an intuitively understandable
possibility of prediction. But this requires a model that contains one or another
description of the structure of fractals, and not just a probabilistic model of
increments.
Section 5 will show how the measurable properties of the autocorrelation
function can be used to make a forecast.
So, the Hurst exponent can clearly indicate whether the time series has the
property of a pure random walk, or has some correlation structure [22, 25, 29]. For
example, in the field of Internet of Things [31] the prediction model may include
a parameter for network traffic as a priori knowledge. Since the self-similarity
property is well interpreted in terms of describing traffic by IT specialists, its use
in conjunction with the features of the deep network increases the interpretability
122 S. Frenkel
of the model, namely, the ability to understand how data properties, in particular
self-similarity, affect the quality of the forecast.
Let us now consider another important property of the data modeled by a random
process, namely, the stationarity.
It is intuitively clear that the stationarity of a random process in the narrow
sense is more favorable for forecasting than non-stationarity, if only because the
joint probability distribution W(xt + 1, x1 , . . . xt ), which determines the conditional
probability of the predictor γ(xt + 1| x1 , . . . xt ), does not depend on time.
Most of the considered real processes, however, are not strictly speaking
stationary.
For example, this in most cases concerns network traffic, where non-stationarity
can be associated with its hopping over a wide range of time scales, and, many
publications show [25, 27] that network traffic does not in principle have stationary
behavior, also due to the presence of time cycles (daily, weekly) and can be
easily affected by network reconfigurations (for example, manual reconfiguration,
dynamic routing changes), communication failures and the deployment of new
machines and applications.
Obviously, for similar reasons, time series describing other real processes, such
as electricity consumption, etc., can also behave.
In other words, one can speak confidently about stationarity only to a certain
extent.
To estimate this “confidence degree”, statistical tests for stationarity are well
known [32], however, it is difficult to include their results in the prediction model,
since apart from intuitive qualitative arguments that stationary processes are better
predicted than non-stationary ones, nothing can be used in the model.
In [31], predictors are compared based on short-term correlations, which is
typical for problems of estimation, classification, and prediction of non-stationary
processes, and it is investigated whether it is useful to include a long-range depen-
dence in the prediction model, which, as noted, is associated with the self-similarity
property. The conclusion is that, first of all, short-term correlations dominate the
contribution to predictor performance, i.e. time to calculate satisfactory, in terms of
the applied criterion, predicted values. As a consequence, linear prediction with a
relatively short correlation structure is sufficient for prediction applications, rather
than the long-term correlations that exist between the future value of the time series
(e.g., traffic level and remembered history)). This is the case, for example, when
there are very many short connections, sometimes called ‘mice’ flows [27].
In [28] it is shown that relatively short observations can be considered as
stationary, at least in variance, because under self-similarity, the variance of the
On Impact of Data Models on Predictability Assessment of Time Series 123
sample mean decreases more slowly than the reciprocal of the sample X(m) size m
(slowly decay variances) Var(X(m) ) ~ a2 m-β with () < β < 1, a2 is a positive constant.
It is also significant that autocorrelations decay as hyperbolic functions, and not
exponentially fast.
Accordingly, with a certain degree of conventionality, one can also speak of
stationarity in terms of autocorrelation functions.
Hurst Exponent as a Measure of Stationarity In [31, 33, 34] it is shown that the
Hurst exponent H can act as a stationarity criterion for the self-similar process.
Let us point out its connection with the possible conventional characteristics of
mathematical statistics and the theory of random processes, and, accordingly, with
predictability.
Values H > 1 indicate non-stationarity.
For a stationary self-similar process H∈ (0.5.1). The closer the value of the Hurst
parameter is to 1, the slower the dispersion decays as the time scale increases, and
the traffic is said to become more pulsating, and therefore non-stationary.
Let us consider how these properties of LRD (self-similar) series affect the
technique for solving various prediction problems. If the task is to predict the
following values according to the root-mean-square criterion, and there is reason
to consider the series stationary according to H), then it makes sense to consider
(compare with) the Kolmogorov-Wiener optimal prediction criterion.
First of all, it is known that the root-mean-square error of linear prediction for
the simplest known linear forecast for a stationary time series x(t) is [35]:
ε = 1 − ρ2 var(x) ,
σ2 = E(x (t + m) − Xe )2
where Xe is the predicted value, for example, by the Kolmogorov linear predictor.
Xe = i=1,n ai x (t − i) ,
where k is the lag number, there is a monotonic relationship between the root-
mean-square error of linear prediction epsilon and the Hurst exponent in the interval
[0.5–1].
This analysis can be useful when using predictive software tools such as the
ARMA moving average autoregressive model, the AR autoregressive model, the
moving average MA model, and the ARIMA autoregressive integrated moving
average; see, for example in [36].
In the case of H > 1, prediction methods for stationary series become inefficient,
unexpected bursts with a magnitude much larger than expected from traditional
models, which is consistent with non-stationarity.
It is important, however, that when the initial process B is nonstationary, for the
increment, the fractional Brownian motion has stationary and dependent increments.
The last expression shows that the increments are positively correlated if H∈(1/2,
1), uncorrelated if H = 1/2, and negatively correlated if H ∈ (0, 1/2).
This property turns out to be important for predicting the sign of increments, and
will be discussed in Sect. 5.
Note, that most statistical forecasting methods are based on the assumption that
time series can be made approximately stationary (i.e. “stationary”) by going over
the difference in data over time, so that instead of directly considering index, we
calculate the difference between successive time steps.
However, the ability to predict data with a time difference, rather than the data
directly, is a much more significant indicator of the model’s predictive power.
Indeed, the prediction of a pure random walk is impossible in principle, but
success rate SR > 0.5 may appear simply due to a slight change in neighboring
values and an accuracy criterion that is insensitive to these changes (in a large
percentage of observed cases).
For a stationary process, the MSE linear prediction theory based on the
Kolmogorov-Wiener model assumes that the time series has a finite mean and
variance.
However, for time series with LRD properties, this cannot always be done. The
fact is that many real data streams formed as a result of overlaying data from several
sources have self-similarity and LRD-property, and at the same time are distributed
according to Pareto [24].
where x ≥ a. The mean and variance of xt that follows are, respectively, given by:
x ≥ a > 0, b > 0
the deep learning models such as convolutional neural networks (CNNs), Graph
Convolutional Network (GCN) [40] and recurrent neural networks (RNN) [41], in
addition to machine learning algorithms such as Support Vector Regression (SVR)
[41, 42].
However, the neural network can have problems due to overfitting [4], when the
model explains well only the examples from the training set, adapting to the training
examples, instead of learning to classify the examples that did not participate in the
training. It is easy to see that self-similarity can contribute to this phenomenon. At
the same time, such a common way to reduce overfitting as Dropout – turning off
some neurons with a certain probability on some data interval from the training
process may not work due to the fact that training will be similar to the previous
one.
Also, possible sudden changes in cloud traffic can be easily confused by a neural
network with traffic anomalies, which leads to training inefficiency.
These sudden changes can be interpreted as non-stationarity by calculating the
values of H.
Another non-linear time series forecasting technique that is being tried for traffic
forecasting is support vector regression (SVR) [39], which is based on structural
risk minimization. However, the choice of suitable kernel functions and optimal
parameters is very difficult [43]. Examples are briefly discussed in [44].
From the above analysis of the dependence of the efficiency of using prediction
models on specific properties of time series, it follows that the natural way to take
into account the dependence of predictor properties on data behavior (stationarity,
nonlinearity of dependence) is their online selection in the process of solving the
network control problem.
This is confirmed in many publications. For example, in [47] it was concluded
that for predicting the response time and throughput of cloud services, at different
stages of resource use, artificial neural network and linear regression algorithms
have different efficiency. In other words, the overall prediction accuracy can
be improved by combining different prediction algorithms. However, due to the
computational complexity of prediction algorithms, the computational complexity
of choosing search tree prediction tools may be unacceptable for the problems of
using prediction in online network management (for examples when using popular
predictors, see Sect. 5).
It is important, that the data represented by time series is different from other
data models in the sense that it gives us additional information about the time of
occurrence of events that can be used when building a machine learning model. For
example, if the time series is highly correlated in time, then its value at time “t + 1”
is likely to be close to the value at time “t”, and the model actually does, one that
when predicting the value at time “t + 1”, it simply uses the value at time “t” as its
prediction.
However, if we are talking about predicting a change in the sign of the increment,
then this closeness does not give anything significant.
Therefore, let us consider separately the problem of predicting the sign of the
increments of random time series and processes.
On Impact of Data Models on Predictability Assessment of Time Series 127
An example of the usefulness of predicting the sign of traffic increments is, for
example, when attacks on a network are carried out in a way that cannot be detected
by antivirus software, and analysts have to rely on analyzing changes in network
traffic (traffic volume), or the direction of changes in file write intensities [46].
In some cases, when analyzing traffic, trend analysis is used not only because of
the usefulness of this characteristic itself, but also due to the fact that in this case
the forecast is more accurate than for the absolute values of the predicted traffic
characteristics (for example, traffic volume per unit time) [47].
Consider the problem of predicting the trend sign of the time series {xi } by the
incremental sequence t + 1 = xt + 1 – xt . In this case, t + 1 is a centered value,
with a sample mean close to zero for finite segments x. As is known from the theory
of random processes, the correlation between successive values of t + 1 turns out
to be much weaker [22] than in the original sequence {xt }. At first glance, this
may indicate a worse predictability of increments compared to the predictability
of the original xt values. This, however, concerns the values of the increments,
not their sign. Moreover, since the sign of the increment means the direction of
change, the sign of the residual autocorrelation of neighboring values is an important
characteristic, since from a practical point of view, a negative sign of autocorrelation
(“negative correlation”) means a tendency to a multidirectional trend in neighboring
sections of the sequence, which, obviously, can be used to predict the sign of
the change. Further, we will show that these intuitive assumptions have a certain
mathematical justification in the theory of random processes [48].
Definition 1 The change in the sign of the increment of values in ran-
dom data is predictable (and the sign of the increment is predictable) if
E(sign(t + 1 )/Ft ) = E(sign(t + 1 )), where E is the sign of the expectation
(according to the distribution of values, in this case t ), Ft is a part of the
probability space on which the random process is considered, which corresponds
to the observed values xt , or xt-k , . . . ,xt , or, say, some sample estimate of the
conditional mean xt + 1 , etc. (from a formal point of view, Ft corresponds to the
so-called “filtering” in the theory of random processes [30]).
In other words, a sign change is predictable if the probability of its current value
(“−” or “+”) changes when certain previous events Ft change.
In terms of the forecasting approach adopted in mathematical statistics, this
means that we can express, say, the probability of a positive change as:
a = E () ,
where, we repeat, the conditional expression ( . . . |t) means the calculation of the
observed values up to the moment t inclusive.
Therefore, the dynamics of changes in variations (“volatility”, to use the termi-
nology of financial mathematics) will affect the sign forecast in all cases when the
conditional mean is not equal to zero. Then the sign of the increment is predictable,
even if the conditional mean is unpredictable at zero mean.
If the distribution of F is skewed, then the sign can be predicted even if the
mean is zero: in this case, the time-varying skewness can be a determining factor in
predicting the sign.
But for a non-zero conditional mean (a > 0), even if the distribution is symmetric
about the conditional mean and the conditional mean is constant by assumption
(unlike the t-dependent variation of σ(t + 1|t), the sign of the increment is
predictable by the above definition.
So, conclusions about the predictability of a feature can only be based on
knowledge of the symmetry of the law of distribution of time series, modeled by
a random process with this law, and on the distribution parameters, namely, the
mean, variance, skewness, etc.
On Impact of Data Models on Predictability Assessment of Time Series 129
which follows from the fact that E(yi |yi ) = yi , E(yi + 1 |yi) = 0 with zero expectation,
and all increments {yi } are independent.
Accordingly, considering the conditional mean as a sign predictor, the following
rule is proposed for predicting the sign of the difference yi + 1 − yi :
In [49], it is stated that uncorrelated increments are sufficient to fulfill the indicated
sign relation. Although uncorrelated does not always mean independence, for
network traffic this assumption can be accepted. Indeed, in real processes, the
dependence of time-sequential values often takes place due to a non-stationary
trend, which is eliminated by centering yi ≡ xi – Exi , and in addition, most real data,
such as changes in traffic volumes, changes in the number of requested IP addresses,
etc. It somehow can be connected with a random change (decrease-increase) of some
external factors [50], and can be represented by random walk processes close to
processes with independent increments [22, 30].
Regarding the prediction accuracy according to the rule (4) (called in the [44]
as SC (“Sign criterion-based”) predictor), we can give more subtle mathematical
reasoning [42].
The share of successful predictions according to (4) for sufficiently long
sequences is proposed to be estimated as [49]:
1
R= + (1 − F (0)) F (0)
2
where:
x
F (x) = dyP (y)
−∞
y = yi+1 − yi, ;
with two combinations there will be a deliberately false solution, namely, in the
cases yt + 1 > 0, yt > 0, yt + 1 > yt , yt + 1 < 0, yt < 0, yt + 1 < yt .
Assuming that all combinations are equally probable, we obtain the a priori
probability of the correct execution of (1) equal to 2/3.
At the same time, there is no reason to believe that two combinations leading to
incorrect predictions will occur on a certain interval several times more often than
those leading to correct ones.
One of the signs of the possibility of using (4) as a predictor is the uncorrelatedness
(or extremely weak correlation) of successive differences in the values of the
considered time series (process) with a tendency to negative correlation of the
values yi + 1 -yi and yi , since the “anti-correlation” of two random variables means a
tendency to change in the opposite direction [22, 49].
The point is that if within a time interval the data is positively correlated, then
changes in a given direction will tend to future changes in the same direction, and
the path will be smoother than the normal Brownian motion process. If the data is
negatively correlated, then there will be a positive change more likely than a negative
one.
The proposed technique of joint use of the continuous and binary sign prediction
models presented here is applied in [44] (called as “SC-0/1 procedure”) and
compared with MLP, XGB, LSTM predictions widely used in modern practice.
Experiments show the high efficiency of the proposed approach.
It is essential that the rule (4) does not require the use of a long sequence of
past observations, which may be inefficient in case of strong non-stationarity and
nonlinearity of the predicted data.
6 Conclusion
As the analysis of the literature shows, many modern prediction tools based on the
principles of MO do not work effectively due to the pronounced nonlinearity of
traffic changes and non-stationarity, the possible inadequacy of assumptions about
the need for a large amount of previous observations. This paper is an attempt
at some ordering and categorization of a huge stream of publications on modern
methods, techniques and models of forecasting data of various nature, which should
to a certain extent simplify the search and analysis of some results of the theory of
random processes, allowing a quick assessment of the predictability of both absolute
data values and signs of their change.
For this, the concept of a conceptual model of algorithms for predicting the
state of systems in the subject area, widely used in modern Big Date and Software
132 S. Frenkel
References
1. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
2. Predict Traffic of LTE Network, [Online] https://www.kaggle.com/naebolo/predict-traffic-of-
lte-network. Accessed on 26 Oct 2020
3. Chen, A., Law, J., Aibin, M.: A survey on traffic prediction techniques using artificial
intelligence for communication networks. Telecom. 2(4), 518–535 (2021)
4. Zhang, J., Tan, D., Han, Zhu, H.: From machine learning to deep learning: Progress in machine
intelligence for rational drugv discovery. Drug Discov. Today. 22(11), 1680–1685 (2017)
5. Rooba, R., Vallimayil, V.: Semantic aware future page prediction based on domain. Int J. Pure
Appl. Math. 118(9), 911–919 (2018)
6. Ryabko, B.: Compression-based methods for nonparametric prediction and estimation of some
characteristics of time series. IEEE Trans. on Inf. Theory. 55(9), 4309–4315 (2009)
On Impact of Data Models on Predictability Assessment of Time Series 133
7. Brovelli, M., Sanso, F., Venuti, G.: A discussion on the Wiener–Kolmogorov prediction
principle with easy-to compute and robust variants. J. Geod. 76, 673–683 (2003)
8. Feder, M., Merhav, N.: Universal prediction. IEEE T. Inform. Theory. 44(6), 2124–2147
(1998)
9. Sharma, S..: Activation functions in neural networks (2019). Retrieved from https://
towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
10. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer Science + Business
Media, New York (2009)
11. Fettke, P.: Conceptual Modelling and Artificial Intelligence, Joint Proceedings of Modellierung
Short. Workshop and Tools & Demo Papers Workshop on Models in AI (2020)
12. Introduction to Unified Modeling Language (UML) 3rd INSPIRATION Training, GFA
(December 4–5, 2012)
13. Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-
time systems. In: In, Proc. 23rd International Conference on Computer Aided Verification
(CAV’11), pp. 585–591. Vol. 6806 of LNCS, Springer (2011)
14. Ryabko, B.: Prediction of random sequences and universal coding. Probl. Inf. Transm. 24(2),
3–14 (1988)
15. Ikharo, A.B., Anyachebelu, K.T., Blamah, N.V., Abanihi, V.: Optimising self-similarity
network traffic for better performance. Int J Sci Technol Res, Int J Sci Technol. Print ISSN:
2395-6011. https://doi.org/10.32628/IJSRST207413164
16. Cesa-Bianchi, N., Lugosi, G.: On prediction of individual sequences. Ann. Stat. 27(6), 1865–
1895 (1999)
17. Yu, P., Kuo, K.-S., Rilee, M.L., Yu, H.: Assessing Deep Neural Networks as Probability
Estimatorsar. Xiv:2111.08239v1 [cs.LG] (2021)
18. Ryabko, B., Monarev, V.: Using information theory approach to randomness testing. J. Stat.
Plan. Inference. 133, 95–110 (2005)
19. Miller, J.B., Sanjurjo, A.: Surprised by the Hot Hand Fallacy? A Truth in the Law of Small
Numbers. arXiv:1902.01265v1 [econ.GN] (2019)
20. Nikravesh, A., Ajila, S.A., Lung, C.-H.: An autonomic prediction suite for cloud resource
provisioning. J. Cloud Comput. Adv. Syst. Appl. 6, 3 (2017). https://doi.org/10.1186/s13677-
017-0073-4
21. Frenkel, S.: Theoretical aspects of a priori on-line assessment of data predictability in applied
tasks 5th International Symposium on Cyber Security Cryptology and Machine Learning
CSCML 2021. LNCS. 12716, 187–195 (2021)
22. Buket Coskun, B., Vardar-Acar, C., Demirtas, H.: A Generalized Correlated,: Random Walk,
Converging to Fractional Brownian Motion. arXiv:1903.05424v3 (2019)
23. Ming, L., Jia-Yue, L.: On the Predictability of Long-Range Dependent Series. Mathematical
Problems in Engineering Volume (2010). https://doi.org/10.1155/2010/397454
24. Brignoli, D.: DDOS detection based on traffic self-similarity (n.d.). https://ir.canterbury.ac.nz/
bitstream/handle/10092/2105/Thesis_fulltext.pdf;sequence=2
25. Graf, S.: Statistically self-similar fractals. Prob. Th. Rel. Fields. 74, 357–392 (1987)
26. Park, R., Hernández-Campos, F., Le, L., Marron, J., Park, J., Pipiras, V., Smith, F., Smith, L.,
Trovero, M., Zhu, Z.: Long-range dependence analysis of internet traffic. J. Appl. Stat. 38(7),
1407–1433 (2011)
27. Megues, P., Molnar, S.: Analysis of Elephant Users in Broadband Network Traffic. 19th
EUNICE Workshop on Advances in Communication Networking (2013). https://doi.org/
10.1007/978-3-642-40552-5_4
28. Leland, W.E., et al.: On the Self-Similar Nature of Ethernet Traffic (Extended Version), pp.
1–15. IEEE Press, Piscataway (1994)
29. Becchi M., From Poisson Processes to Self-Similarity: a Survey of Network Traffic Models.
2008., https://www.cse.wustl.edu/~jain/cse567-06/ftp/traffic_models1/index.html
30. Bosq, D., Nguyen, H.: A Course in Stochastic Processes. Stochastic Models and Statistical
Inference, Kluwer, Dordrecht (1996)
134 S. Frenkel
31. Pan, C., Wang, Y., Shi, H., Shi, J., Cai, R.: Network traffic prediction incorporating prior
knowledge for an intelligent network. Sensors. 22(7), 2674 (2022)
32. Lavasani, A., A., Eghlidos, T.: Practical next bit test for evaluating pseudorandom sequences.
Comput. Sci. Eng. Electric. Eng. 16(1), 19–33 (2009)
33. Park, C., Hernandez, F., Le, L., Marron, J.S., Park, J., Pipiras, V., Smith, F.D., Smith, R.L.,
Trovero, M., Zhu, Z.: Long range dependence analysis of Internet traffic. Journal of Applied
Statistics. 38, 1407–1433 (2004)
34. He, H.: 1 Shitao Cheng, 1 and Xiaofu Zhang, signal nonstationary degree evaluation method
based on moving statistics theory. Shock. Vib. 2021., Article ID 5562110, 18 (2021). https://
doi.org/10.1155/2021/5562110
35. Box, G.E.P., Jenkins, G.M., Reinsel, G.C.: Time Series Analysis: Forecasting and Control.
Wiley, New York (2008)
36. Lyman, J., Edmonson, W., McCullough, C., Rao, M.: The predictability of continuous-time
band limited processes. IEEE Trans. Signal Process. 48(2), 311–316 (2000)
37. Song, W., Duan, S., Chen, D., Zio, E., Yan, W., Cai, F.: Finite iterative forecasting model based
on fractional generalized Pareto motion. Fractal Fract. 6, 471 (2022)
38. Loiseau, P., Gonçalves, P., Dewaele, G., Borgnat, P., Abry, P., Primet, P.: Investigating self-
similarity and heavy-tailed distributions on a large-scale experimental facility. IEEE/ACM
Trans. netw. 18, 1261–1274 (2010)
39. Chen, A., Law, J., Aibin, M.: A survey on traffic prediction techniques using artificial
intelligence for communication networks. Telecom. 2, 517–536 (2021)
40. Vinchoff, C., Chung, N., Gordon, T., Lyford, L., Aibin, M.: Traffic Prediction in optical
networks using graph convolutional generative adversarial networks. In: In Proceedings of the
International Conference on Transparent Optical Networks, pp. 3–6. Bari, Italy (2020)
41. Aibin, M.: Deep Learning for Cloud Resources Allocation: Long-Short Term Memory in
EONs. In Proceedings of the International Conference on Transparent Optical Networks,
Angers, France, 9–13 July 2019; pp. 8–11
42. Yin, F., Wang, J., Guo, C. (eds.): A Boosting-Based Framework for Self-Similar and Non-linear
Internet Traffic Prediction ISNN 2004, pp. 931–936. LNCS 3174 (2004)
43. Shi, Y., Fernando, B., Hartley, R.: Action Anticipation with RBF Kernelized Feature Mapping
RNN. arXiv:1911.07806v3 [cs.CV] 11 Jul (2021)
44. Frenkel, S.: Predicting the direction of changes in the values of time Series for relatively small
training samples. In: 6th International Symposium on Cyber Security Cryptology and Machine
Learning CSCML 2021CSCML, Beer-Sheva, Israel, pp. 118–134. Proceedings, Lecture Notes
in Computer Science (13301) (2022)
45. The Influence of Long-Range Dependence on Traffic Prediction Sven A. M. Östring, H.
Sirisena Published 11 June 2001 Computer Science ICC 2001. IEEE International Conference
on Communications. Conference Record (Cat. No.01CH37240)
46. Nikravesh, A.Y., Ajila, S.A., Lung, C.-H.: An autonomic prediction suite for cloud. J. Cloud
Comput. Adv. Syst. Applic. 6, 3 (2017). https://doi.org/10.1186/s13677-017-0073-4
47. Zhao, A., Liu, Y.: Application of Nonlinear Combination Prediction Model for Network Traf-
fic. 2nd International Conference on Electronic & Mechanical Engineering and Information
Technology (EMEIT-2012). Proceedings, 2337–2340 (2012)
48. Christoffersen, P., Diebold, F.: Financial asset returns, direction-of-change forecasting, and
volatility dynamics. Manag. Sci. 52(8), 1273–1287 (2006)
49. Sornette, D., Andersen, J.: Increments of uncorrelated time series can be predicted with a
universal 75% probability of success. Int. J. Mod. Phys. 11(4), 713–720 (2000)
50. Cloud, B.L., Dalmazo, L., Vilela, M.: Performance analysis of network traffic predictors. J.
Netw Syst Manage. 25, 290–320 (2017)
51. Aryan, M.: Efficient Methods for Large-Scale Empirical Risk Minimization. A Doctoral
Thesis, Philadelphia, PA (2017)
52. Kleeman, R.: Information theory and dynamical system predictability. Entropy. 13, 612–649
(2011)
A Three-Step Method for Audience
Extension in Internet Advertising Using
an Industrial Taxonomy
1 Introduction
Modern technologies transform many areas of human life and modify industries.
The era of Big Data brought on remarkable possibilities for extracting meaningful
insight from the data. In particular, the growth of the digital advertising, whereby
an increasing share of advertisement moves from traditional formats (such as
TV, radio, out-door) to the Internet both accelerates the production of raw data
and creates the demand for insight generated from it. Digital advertising brings
new ways to investigate potential audiences, create and control advertisements,
and evaluate results. Companies wish to predict consumer preferences, determine
relevant customer segments for particular products or services, and target marketing
offers based on the data. The amount and diversity of data on one hand, and the
rapidly increasing sophistication of methods on the other allow for such predictions
to be carried out with ever increasing accuracy. For example, they allow marketers
to expand on their time-tested ideas of targeting consumers. Whereas in the past
simple regression methods allowed a company to identify a group of people that
were very much like their existing customers, nowadays it becomes possible to go
beyond that to what is now referred to as an audience extension. In general, ability
to derive better, more accurate insight and to extract more information from the data
becomes essential for any business that wishes to remain competitive. The objective
of this paper is to propose a better method of extracting useful user insight from the
D. Frolov ()
Department of Data Analysis and Artificial Intelligence, HSE University, Moscow, Russia
Z. Taran
Division of Management, Marketing, and Business Administration, Delta State University,
Cleveland, MS, USA
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 135
B. Goldengorin, S. Kuznetsov (eds.), Data Analysis and Optimization, Springer
Optimization and Its Applications 202,
https://doi.org/10.1007/978-3-031-31654-8_8
136 D. Frolov and Z. Taran
data that the organization is likely already collecting without the need to gather new
information.
more preferable. Of course, this conclusion holds only if the relative weight of an
offshoot is less than the total relative weight of three gaps.
We are interested to see whether a fuzzy set S can be generalized by a node t
from higher ranks of the taxonomy, so that S can be thought of as falling within the
framework covered by the node t. The goal of finding an interpretable pigeon-hole
for S within the taxonomy can be formalized as that of finding one or more “head
subjects” t to cover S with the minimum number of all the elements introduced at the
generalization: head subjects, gaps, and offshoots. This goal realizes the principle
of Maximum Parsimony (MP).
Consider a rooted tree T representing a hierarchical taxonomy so that its nodes
are annotated with key phrases signifying various concepts. We denote the set of all
its leaves by I . The relationship between nodes in the hierarchy is conventionally
expressed using genealogical terms: each node .t ∈ T is said to be the parent of
the nodes immediately descending from t in T , its children. We use .χ (t) to denote
the set of children of t. Each interior node .t ∈ T − I is assumed to correspond to a
concept that generalizes the topics corresponding to the leaves .I (t) descending from
A Three-Step Method for Audience Extension over Internet Advertising 139
t, viz. the leaves of the subtree .T (t) rooted at t, which is conventionally referred to
as the leaf cluster of t.
A fuzzy set on I is a mapping u of I to the non-negative real numbers that assigns
a membership value, or support, .u(i) ≥ 0 to each .i ∈ I . We refer to the set .Su ⊂ I ,
where .Su = {i ∈ I : u(i) > 0}, as the base of u. In general, no other assumptions
are made about the function u, other than, for convenience, commonly limiting it
to not exceed unity. Conventional, or crisp, sets correspond to binary membership
functions u such that .u(i) = 1 if .i ∈ Su and .u(i) = 0 otherwise.
Given a fuzzy set u defined on the leaves I of the tree T , one can consider u to be
a (possibly noisy) projection of a higher rank concept, u’s “head subject”, onto the
corresponding leaf cluster. Under this assumption, there should exist a head subject
node h among the interior nodes of the tree T such that its leaf cluster .I (h) more or
less coincides (up to small errors) with .Su . This head subject is the generalization of
u to be found. The two types of possible errors associated with the head subject, if it
does not cover the base precisely, are false positives and false negatives, referred to
in this paper, as gaps and offshoots, respectively. They are illustrated in Figs. 2 and
3. Given a head subject node h, a gap is a node t covered by h but not belonging
to u, so that .u(t) = 0. In contrast, an offshoot is a node t belonging to u so that
.u(t) > 0 but not covered by h. Altogether, the total number of head subjects, gaps,
.Su . Obviously, if a node is u-irrelevant, all of its descendants are also u-irrelevant.
therefore lost) at any of t’s ancestors. The algorithm ParGenFS recursively computes
.H (t), .L(t) and .p(t) from the corresponding values for the child nodes in .χ (t).
Specifically, for each leaf node that is not in .Su , we set both .L(·) and .H (·) to
be empty and the penalty to be zero. For each leaf node that is in .Su , .L(·) is set to
be empty, whereas .H (·), to contain just the leaf node, and the penalty is defined as
its membership value multiplied by the offshoot penalty weight .γ . To compute .L(t)
and .H (t) for any interior node t, we analyze two possible cases: (a) when the head
subject has been gained at t and (b) when the head subject has not been gained at t.
In case (a), the sets .H (·) and .L(·) at its children are not needed. In this case,
.H (t), .L(t) and .p(t) are defined by:
H (t) = {t}
. L(t) = G(t) (2)
p(t) = u(t) + λV (t).
In case (b), the sets .H (t) and .L(t) are just the unions of those of its children, and
p(t) is the sum of their penalties:
.
H (t) = H (w)
w∈χ (t)
.
L(t) = L(w) (3)
w∈χ (t)
p(t) = p(w).
w∈χ (t)
A Three-Step Method for Audience Extension over Internet Advertising 141
To obtain a parsimonious lift, whichever case gives the smaller value of .p(t) is
chosen.
When both cases give the same values for .p(t), we may choose, say, (a). The
output of the algorithm consists of the values at the root, namely, H – the set of
head subjects and offshoots, L – the set of gaps, and p – the associated penalty.
ParGenFS Algorithm
INPUT: u, T
OUTPUT: .H = H (Root), .L = L(Root), .p = p(Root)
I Base case: for each leaf of the T .
for each leaf .i ∈ I
.L(i) =
if .u(i) > 0
.H (i) = {i}
.p(i) = γ u(i)
else
.H (i) =
.p(i) = 0
.L(t) = G(t)
.p(t) = pgain
else
.H (t) =
w∈χ (t) H (w)
.L(t) = w∈χ (t) L(w)
.p(t) = pnogain
Intuitively, the method consists of three steps: (1) computing membership values
for the interest segments for a user by a classifier; (2) performing generalization of
those sets and obtaining high-ranked segments, which is a core part of the method;
(3) obtaining a set of advertising campaigns for a user.
There are three inputs to our method, Developing and Lifting of User Segment
Profile (DLUSP). These are: (i) a large set of internet users from which the audience
142 D. Frolov and Z. Taran