An Implementation of The HDBSCANClustering Algorithm
An Implementation of The HDBSCANClustering Algorithm
sciences
Article
An Implementation of the HDBSCAN* Clustering Algorithm
Geoffrey Stewart 1 and Mahmood Al-Khassaweneh 1,2, *
Featured Application: The clustering implementation being presented can be used to discover
clusters and identify outliers in a dataset. This implementation provides a fast prediction feature
that makes it a compelling choice for applications, such as a streaming clustering service.
Citation: Stewart, G.; 1. Introduction
Al-Khassaweneh, M. An Cluster analysis is the task of arranging a set of similar data points into groups. It is an
Implementation of the HDBSCAN* unsupervised machine learning task that can discover patterns or identify related groups
Clustering Algorithm. Appl. Sci. 2022, from a dataset. There are many different algorithms proposed for cluster analysis [1,2]. The
12, 2405. [Link] K-Means algorithm is commonly used for clustering since it is fast and easy to understand.
app12052405 However, there are some potential problems with this algorithm. First, the number of
Academic Editor: Vincent A. clusters, or partitions, need to be provided as input to the algorithm. The number of
Cicirello clusters that may be present in a new dataset is, however, not always known. Another
problem is that K-means is more of a partitioning algorithm than a clustering algorithm,
Received: 26 January 2022
since it partitions all the data into groups by minimizing distances between data points.
Accepted: 23 February 2022
These problems are also exist in other clustering algorithms. For example, the affinity
Published: 25 February 2022
propagation and spectral clustering algorithms also partition all the data into groups,
Publisher’s Note: MDPI stays neutral so even outliers or noisy data points are included in the clusters. Another well-known
with regard to jurisdictional claims in algorithm is agglomerative clustering, which like K-Means, requires the number of clusters
published maps and institutional affil- to be provided as input to the algorithm when a threshold parameter is not used.
iations. The HDBSCAN* clustering algorithm [3] is a density-based algorithm. Unlike K-
Means, it does not require that every data point is assigned to a cluster, since it identifies
dense clusters. Points not assigned to a cluster are considered as outliers, or noise. An
algorithm that can effectively find distinct groups in a dataset and identify outliers is a
Copyright: © 2022 by the authors.
valuable technique. There is an established Python implementation of the HDBSCAN*
Licensee MDPI, Basel, Switzerland.
algorithm [4] which is included as a scikit-learn compatible project. It is well known that
This article is an open access article
distributed under the terms and
Python is a very popular language for data mining and machine learning tasks, such
conditions of the Creative Commons
as clustering. Java, on the other hand, is not as popular for these types of tasks, even
Attribution (CC BY) license (https:// though there are numerous Java enterprise applications running in production currently.
[Link]/licenses/by/ Machine learning libraries offer the benefit of providing a consistent API to variety of
4.0/). learning algorithms and utilities. Tribuo [5] is a recent open-sourced Java machine learning
library. Tribuo is proven to be robust and performant, but currently it only supports the
K-Means clustering algorithm. Therefore, it is valuable to build an implementation of the
HDBSCAN* algorithm in the Tribuo library.
This work offers a new and optimized implementation of the HDBSCAN* algorithm
in Java. A full tutorial about this implementation is available at [6]. The proposed imple-
mentation leverages Tribuo’s infrastructure to provide a train/predict interface consistent
with the library’s other learning algorithms and features. This gives the ability to train
models which discover clusters and identify outliers. This work introduces a novel pre-
diction technique which uses a trained model to make predictions for unseen data points
natively in Java. Functionality such as this provides the foundation for applications, such
as a streaming clustering service. Furthermore, the methodology presented in this work
illustrates a generic development process which can be reused for the implementation of
an existing learning algorithm in an existing machine learning library.
The remainder of this article is structured as follows. A background and literature
review are presented in Section 2. Section 3 provides a description of the methodology
of this work. Section 4 discusses the results of the comparisons between the HDBSCAN*
implementations. Finally, Section 5 presents a summary discussion of the work.
2.2. HDBSCAN*
More recently, the HDBSCAN* algorithm [3] was introduced. It stands for Hierarchical
DBSCAN*. The asterisk suffix indicates an improvement the same authors of [3] make to
DBSCAN. These authors extend their work in [8] by introducing a complete framework
for cluster analysis and outlier detection which includes an enhanced explanation of the
HDBSCAN* algorithm. HDBSCAN* improves on DBSCAN by establishing a hierarchical
representation of the clusters. The hierarchy derived from the execution of the algorithm can
be used very effectively for cluster extraction and outlier detection. HDBSCAN* overcomes
the limitations of DBSCAN by identifying clusters of any density. Although there are
several benefits provided by the HDBSCAN* algorithm, there is one drawback that should
be considered. The algorithm has an overall asymptotic complexity of O(n2 ). Furthermore,
there are multiple sub-steps of the algorithm, which themselves have a complexity of
O(n2 ). This complexity can have a significant impact on the algorithm execution time for
large datasets. There are implementations of the K-Means algorithm mentioned earlier,
which have a complexity of O(n). Table 1 summarizes the strengths and weaknesses of the
DBSCAN and HDBSCAN* algorithms.
Hierarchical clustering methods are another form of clustering analysis which can
work without the need to specify the number of clusters. These methods construct clusters
by recursively partitioning the data points in either a top–down or bottom–up approach [10].
Divisive clustering starts with a single cluster containing all points and is divided into
sub-clusters based on some criteria, which are successively divided into further sub-clusters.
There are different strategies to determine when the process should be terminated. Agglom-
erative hierarchical clustering starts with each data point in its own cluster and merges
clusters sharing some criteria, and continues to successively merge clusters. Again, there
are different strategies which can be employed to terminate the process. Hierarchical
clustering methods can provide high interpretability since the results can be presented
in a dendrogram, and the choices for cluster splits or merges are well defined. These
methods do not provide any mechanism for identifying outliers. They also tend to have
high asymptotic complexity and large memory requirements.
In the previous section, it was mentioned that partitioning algorithms, such as K-
Means require the number of clusters, k, to be defined as input to the algorithm. In some
cases, the value for k is unknown. There are techniques which can be used to determine
the optimal number of clusters from a dataset, such as the frequently mentioned elbow
method. The elbow method is somewhat subjective, and, therefore, unreliable. There are
also more analytical techniques that leverage silhouette analysis to estimate the optimal
number of clusters from a dataset. One such example is an algorithm called k-SCC [11].
Combining the k-SCC algorithm with K-Means effectively achieves a natural discovery of
clusters from a dataset. However, this combination is not able to distinguish outlier points
from the clusters as can be done using HDBSCAN*.
Using the core distances, a new distance metric is computed which is called the mutual
reachability distance. The mutual reachability distance between two points x and y, is the
maximum value of: the core distance of x, the core distance of y, or the distance between
x and y. The points (2.7, 2.7) and (3.0, 3.0) shown in Figure 2 have a mutual reachability
distance equal to the distance between the two points. Concretely, for k = 5, the core
distance of (2.7, 2.7) is 0.22, the core distance of (3.0, 3.0) is 0.31 and the distance between
the two points is 0.42. Therefore, the mutual reachability distance between these points
is 0.42.
Next, the mutual reachability distances between the points can be used to establish a
weighted graph, where the data points are vertices and an edge between any two points
has the weight of the mutual reachability distance of those points. This complete graph
is only a conceptual artifact in the algorithm, since it is the graph’s minimum spanning
tree which needs to be computed. A minimum spanning tree is a spanning tree whose
sum of edge weights is as small as possible. That is, it has all vertices connected with a
subset of the edges from the complete graph, without any cycles and with the minimum
possible total edge weight. The resulting minimum spanning tree is modified by adding a
self-edge to each vertex with the point’s core distance as the weight. This gives a graph
called the extended minimum spanning tree. Figure 3 shows the minimum spanning tree
of the simple dataset.
Now, the graph can be used to build the HDBSCAN* hierarchy. To begin with, there is
one cluster label, and all the points are assigned to this cluster. This cluster is added to a
cluster list. The graph is then sorted by edge weight in ascending order. Starting from the
bottom of the graph, edges are iteratively removed from the extended minimum spanning
tree. Edges with equal weights must be removed simultaneously. The weight value of
the edge(s) being removed is used to denote the current hierarchical level. As an edge is
Appl. Sci. 2022, 12, 2405 5 of 21
removed, the cluster containing the removed edge is explored. Clusters that have become
disconnected and contain fewer points than s (minimum cluster size) are all assigned with
the noise label. A cluster that has become disconnected, but has more than s points, is
assigned a new cluster label. This is called a cluster split. Additionally, the new cluster
is added to the list of clusters. A new hierarchy level is created when an edge removal
has resulted in new clusters due to cluster splits. By the end of the process, in the last
level of the hierarchy, all the points in the dataset will have been assigned to noise. The
hierarchy produced by this process is the HDBSCAN* hierarchy. Figure 4 shows a table
representation of the HDBSCAN* hierarchy of the simple dataset. Each column of the table
represents a data point, and the rows contain the cluster assignments as the algorithm runs.
In the top row, all the points are assigned to the same, single cluster. In the second row,
some points are set to zero because they have been split off and are marked as outliers. In
the third row, there are now two distinct clusters and one additional outlier. By the last
row, all the points are assigned as outliers or noise points. During the construction of the
HDBSCAN* hierarchy, a list of the clusters is also maintained, where each cluster holds a
reference to its parent.
Figure 2. The core distances for the points (2.7, 2.7) and (3.0, 3.0) for k = 5, using a zoomed in view
on a particular area of the simple dataset.
Appl. Sci. 2022, 12, 2405 6 of 21
Figure 3. The minimum spanning tree of the simple dataset. An extended minimum spanning
tree would contain self-edges for each point, with the weight of the self-edge being the point’s core
distance value.
With the HDBSCAN* hierarchy and the list of clusters computed, the next step is to
identify the prominent clusters from this hierarchy. To proceed, a new measure needs to be
established that can be used towards determining the stability of a cluster. This is called
lambda, such that λ = 1/edge weight. Further, for a cluster we also define values λbirth
and λdeath to be the lambda value when a new cluster was created from a cluster split, and
the lambda value when that same cluster was itself split, respectively. For each point in a
cluster, we can define the value λ p as the lambda value at which that point dropped out of
the cluster. The stability for a cluster can be computed, as shown in Equation (1).
The stability needs to be propagated through the clusters. Leaf clusters are clusters
with no children, and they can be identified from the list of clusters. Starting with these
leaf clusters, traverse up using the reference to the cluster’s parent. Leaf clusters always
propagate their stability to their parents and add themselves as a propagated descendent
in the parent cluster. For non-leaf clusters one of two things will occur. If the cluster
being processed has a higher stability than the cumulative stability of its descendants,
it alone will be propagated to the parent cluster. Otherwise, the cumulative stability of
all the current cluster’s descendants will be propagated to the parent cluster. Clearly,
no propagation occurs for the root cluster since it has no parent. When this process is
Appl. Sci. 2022, 12, 2405 7 of 21
finished, the root cluster will contain the references to descendent clusters with the highest
stabilities. The clusters with the highest stabilities are the most prominent clusters. Using
the HDBSCAN* hierarchy together with the details of the most prominent clusters, the
list of cluster assignments for each data point can be quickly generated. This is the most
significant artifact generated as output of the HDBSCAN* algorithm described in [8]. The
prominent clusters identified using the simple dataset are shown in Figure 5. Three clusters
are identified, and the five yellow points are determined to be outliers.
Figure 5. The clustering result produced by HDBSCAN* using the simple dataset.
Another important artifact, that can be obtained from the execution of the HDBSCAN*
algorithm, are the outlier scores for each data point. The literature [8] names this as a
point’s GLOSH which stands for Global–Local Outlier Score from Hierarchies. Fortunately,
computing a data point’s GLOSH is not too complex, but requires that some additional
bookkeeping was done during the construction of the HDBSCAN* hierarchy. Earlier in this
section, it was mentioned that the weight value of the edge being removed is used to denote
the current hierarchical level. This weight value, and the last cluster label must be noted
for every point, at the moment the point is marked as noise, or assigned the noise label, to
use the same wording as before. A data point, x, has the values e and emax , which are: the
weight value of the point right before it was marked as noise, and the lowest propagated
child death level from its last labeled cluster, respectively. The value of the outlier score for
a point x can be obtained using Equation (2).
emax ( x )
GLOSH ( x ) = 1 − (2)
e( x )
Appl. Sci. 2022, 12, 2405 8 of 21
This section is summarized by Algorithm 1 which shows the main steps of the HDB-
SCAN* algorithm.
monly require access to the file system to persist results. Further, I/O can lead to
performance issues;
• No unit tests. The reference implementation has no self-contained unit tests to verify
the algorithm;
• There is some constraints functionality present in the reference implementation which
adds complexity to several steps of the algorithm. This feature is not required;
• None of the logic leverages parallelization. There are several blocks of logic with high
asymptotic complexity which could cause performance issues.
Beyond these issues, it should be noted that this reference implementation is not being
actively maintained and can not be easily integrated with an existing Java application. This
highlights the value of developing a new HDBSCAN* implementation as a feature of an
existing machine learning library for Java.
were created in Tribuo. The main benefit it offers is that any of these objects can be
regenerated from scratch, which is valuable for reproducibility. Implementing a new feature
in Tribuo requires some additional considerations to cooperate with the Provenance system.
distributed around a defined number of centroids. This makes it easy to vary the number
of samples, the number of features, and the number of centroids. The generated datasets
are written to .csv files. The reference implementation is executed using a .csv file as input,
and the resulting outputs are used to make assertions in the unit tests. Several different
unit tests are created each with a unique dataset. Once unit tests are established, changes
can be made safely since they can be verified quickly by executing the tests.
been implemented, are identified and analyzed as part of this review. The current version
of Tribuo provides an implementation of K-Means clustering. This specific part of the
project should be carefully reviewed since the new clustering algorithm being added will
likely be quite similar. The results of the analysis of the Tribuo project are captured back in
Section 2, in the Background and Literature Review, since they should be considered as a
component of the background for this work.
to figure out how to load these as Dataset <ClusterID> objects. Recall that a Dataset
<ClusterID> object is provided as input to an HdbscanTrainer instance’s train method. It
is a major milestone when the first unit test in the Tribuo Hdbscan module is successfully
executed which makes the same assertions as those from the working, minimal HDBSCAN*
implementation. This indicates that for a specific dataset, the reference implementation
and the Tribuo Hdbscan implementation produce the identical cluster assignments and
outlier scores. Further comparisons between these two implementations are performed in
the upcoming Compare the HDBSCAN* implementations section.
The input to the algorithm is the dataset, and a list of cluster assignments. First, the
cluster assignments are used to create a map, where the key of this map is the cluster label,
and the value is a collection of the points assigned to this cluster label. The collection
of points is sorted by a point’s outlier score. Next, the number of exemplars must be
established. Let l be the size of the dataset and let c represent the number of identified
Appl. Sci. 2022, 12, 2405 14 of 21
clusters, for a given dataset and its corresponding HDBSCAN* result. The exemplar number
formula is shown in Equation (3).
r
l
exemplar number = b + cc (3)
2
There are some important details to mention about this formula. The square root
of half of l allows the number of exemplars to grow as the size of the dataset increases,
but the growth slows as size of the dataset increases. Adding the value of c ensures there
are always at least as many exemplars as there are cluster assignments. Lastly, the floor
function is applied to return the greatest integer less than or equal to the computed value.
The exemplar formula is not rigorously proven by induction or through formal methods.
Instead, it is empirically shown to produce values for the exemplar number which result
in accurate predictions, without growing too large, which would reduce prediction speed.
It is certainly possible that there exists a different formula that provides a better balance
between prediction accuracy and prediction speed. Additionally, the constant of 2 in the
formula could be a configurable input parameter to the algorithm to give the user a means
to influence this behavior based on their requirements.
To provide some intuition for the values produced by this formula, Table 2 shows
some sample “exemplar numbers” calculated for different configurations.
With the total number of exemplars established, the map of cluster assignments is
iterated. The number of exemplar points to be drawn from each cluster is proportional
to the size of the dataset. A cluster with more assigned points, will have more exemplars
drawn from it. Out of all the points assigned to a cluster, the points selected as exemplars are
those with the lowest outlier scores. With one exception: the cluster of noise points. Recall
that any outlier points are assigned to the noise cluster label. Points selected as exemplars
from the noise points are those with the highest outlier score. This process determines a
set of cluster exemplars which are used for predicting the cluster assignments and outlier
scores for new data points. This novel technique, which uses carefully computed√cluster
exemplars to estimate cluster assignments, has an asymptotic complexity of O( n). To
make a prediction for a new data
√ point, the point is compared with each cluster exemplar,
and there are approximately n exemplar points.
This prediction functionality is useful to approximate how a point would fit into
an existing HDBSCAN* cluster model. New points do not change or contribute to the
existing cluster model. A Tribuo Hdbscan clustering model used for predictions should be
regularly retrained using up-to-date data to ensure the model provides the most accurate
clustering representation.
to compute the core distances has a time complexity of O(n2 ). The second step in the
algorithm is to compute the extended minimum spanning tree graph. This also takes
O(n2 ). These findings are consistent with the descriptions the authors of [8] provide in
their detailed complexity analysis of the HDBSCAN* algorithm. Rather than repeat all the
details from this complexity analysis here, a more practical approach to understanding
the performance of the algorithm is taken. The execution times of each major step of the
algorithm are collected when training models using a variety of datasets. These are not
meant to be rigorous experiments. Only a general idea of where the algorithm spends most
of its time is needed at this point. In fact, the first step in the algorithm which computes
the core distances consistently takes the longest out of all the steps. It is the next step that
computes the extended minimum spanning tree graph that takes the second longest.
Therefore, it makes sense to focus on parallelizing these two steps of the algorithm,
starting with the core distance calculations. In Java 8, there are several APIs which can
be used for implementing concurrency. An initial attempt is made to use a ForkJoinPool
which is designed for work that can be broken into smaller pieces recursively. This im-
plementation does not perform well and ended up taking longer in all cases than the
sequential version of the code. Next, the core distance calculations were implemented
using the ExecutorService API. For datasets larger than 5000 records, executing this logic
in parallel with multiple threads is always faster than the sequential implementation. For
small datasets, the overhead of starting and shutting down the ExecutorService appears
to degrade performance. For this reason, when the HdbscanTrainer is instantiated with
the variable numThreads equal to one, this condition will be detected, and the original
sequential core distance calculation will be executed. This provides a mechanism to achieve
optimal performance results for smaller datasets. The use of Java’s Parallel Streams API is
not explored here since some cumbersome steps would have been required to transform
the existing data structures into streams-compatible datatypes.
Exact measurements comparing various core distance calculation execution times are
not observed or presented here. This analysis is too fine grained. It is more important to
focus on measuring the time it takes to execute the complete algorithm, since that is how
the code will be used in practice. Such measurements will be presented and discussed in
the following section.
The next step of this task is to parallelize the block of logic that computes the extended
minimum spanning tree graph. Similar to the core distance calculation, there is a loop
nested within a loop which causes the time complexity to be O(n2 ). Carefully reviewing this
logic, again with the intention of executing it in parallel, reveals that each execution of the
outer loop cannot be safely performed in separate threads. Each subsequent execution of the
outer loop depends on a result from the previous executions of the loop. This indicates each
iteration of the outer loop needs to be made sequentially. Instead, an implementation which
again uses the ExecutorService API is developed which submits only the code within the
inner loop to a thread pool executor. This solution still required some synchronization on a
specific block of code to maintain the safety of the logic. Unfortunately, this multithreaded
implementation of the logic that computes the extended minimum spanning tree graph
did not perform well with any of datasets used for testing. At this time, this step of the
HDBSCAN* algorithm will not be parallelized.
3.5.2. Predictions
Next, a comparison is made using the predictions made by Tribuo Hdbscan and
the Python module hdbscan. Recall that the reference implementation does not provide
prediction functionality. A model is trained for both implementations using the same
training dataset, and then each model is used to make predictions on a separate test dataset.
This process is repeated with three different datasets. All three datasets use Gaussians
because artificial datasets such as this provide the point’s assigned cluster, which is useful
for evaluating the quality of the predictions. Note that each of the three datasets is split
into separate training and test files in advance, to avoid any subtle differences that may
occur in the way each library splits the data after it has been loaded. The first dataset
uses Gaussians with four centroids and 2000 points, where each point has three features.
In this case the data are split with 99% for training and only 1% for test. Note that for
density-based clustering algorithms such as HDBSCAN*, splitting the data like this does
not result in over fitting. In fact, this should improve the quality of the predictions. The
second dataset uses Gaussians with three centroids and 5000 points, where each point has
four features. The third dataset uses Gaussians with five centroids and 5000 points, where
each point has four features. For both the second and third datasets, the data are split with
80% for training and 20% for test.
3.5.3. Performance
Additionally, the performance of the Tribuo Hdbscan implementation is compared
to the other HDBSCAN* implementations reviewed in this work. The first aspect of
performance to compare is the model training times. The approach is simple, a model
is trained for each implementation using the same dataset, and the training times are
recorded. Each test is performed three times and the average training times are calculated.
This process is repeated with three different datasets. The first dataset is the same credit
card usage data already described above. The second dataset uses Gaussians with six
centroids and 50,000 points, where each point has seven features. The third dataset uses
Gaussians with six centroids and 100,000 points, where each point has seven features.
The second aspect of performance to compare is the model prediction times, thus
Tribuo Hdbscan is compared to the Python module hdbscan [4]. A model is trained for
both implementations using the same training dataset. Then, each model is used to make
predictions on a separate test dataset and the prediction times are recorded. Again, each
test is performed three times and the average training times are calculated. This process is
Appl. Sci. 2022, 12, 2405 17 of 21
repeated with two different datasets. Each dataset is split into separate training and test
files in advance, to avoid any subtle differences that may occur in the way each library splits
the data after it has been loaded. The first dataset uses Gaussians with six centroids and
50,000 points, where each point has seven features. The second dataset uses Gaussians with
six centroids and 100,000 points, where each point has seven features. For both datasets, the
data are split with 60% for training and 40% for predictions. For accurate predictions, the
data should not be split in this way. The best practice would be to allocate a much higher
percentage for the training set. For these tests however, it is more interesting to provide a
larger volume of data to the prediction call.
Table 3. The adjusted mutual information scores of the cluster assignments for trained models.
When comparing the outlier scores between Tribuo Hdbscan with the Python module
hdbscan, the results were quite different for every dataset. Because these are decimal values
between zero and one, some investigation was made to determine if the differences were
very small fraction values. Even with difference tolerances of 0.01, the results are still
significantly different.
Appl. Sci. 2022, 12, 2405 18 of 21
4.2. Predictions
Tribuo Hdbscan and the Python module hdbscan produce identical cluster assignment
predictions for each of the three datasets. This confirms that the novel technique to make
fast predictions provided with Tribuo Hdbscan produces high-quality results. Table 4
shows the adjusted mutual information scores for the predicted cluster assignments of
these two clustering implementations for the described datasets. Recall again that the
reference implementation does not provide prediction functionality.
It is interesting to note that both implementations make the identical prediction errors
for the dataset of Gaussians with four centroids and 2000 points. The adjusted mutual
information score for the predicted cluster assignments for this test case is 0.87. Both
models identify three clusters and assign several points as outliers or noise, even though
this synthetic dataset is defined to have four clusters. Recall that a similar observation
is made above when reviewing the cluster assignments for training step of the dataset.
Fortunately, this particular dataset was defined with only three features, so it is possible to
plot and visualize. The resulting visualization does show that two of the clusters appear
nearly combined, which explains this result.
Table 4. The adjusted mutual information scores for the predicted cluster assignments.
Lastly, since the cluster models trained by these two HDBSCAN* implementations
demonstrated significantly different outlier scores, the predicted outlier scores are
not compared.
4.3. Performance
The results from comparing the model training times are best visualized. Figure 6
shows a clustered column chart comparing the model training times for the three datasets.
Tribuo Hdbscan performs better than the reference implementation. Even though there is
some overhead from the Tribuo framework, and some extra steps to compute the cluster
exemplars, the improved training times are a direct result of parallelizing the core dis-
tance calculations. On the test hardware, instantiating the HdbscanTrainer class with the
variable numThreads set to eight produced the best results, as can be seen in the relevant
notebooks [16].
Figure 6. A clustered column chart comparing the model training times for the three datasets.
Appl. Sci. 2022, 12, 2405 19 of 21
The Python module hdbscan is faster than Tribuo Hdbscan by an order of magnitude
in the time taken to train the models. This implementation takes advantage of scikit-learn’s
KDTree and BallTree nearest neighbor algorithms to compute the core distances. It also
uses numpy extensively and leverages its optimized numerical array operations.
The results from comparing the model prediction times are also best visualized.
Figure 7 shows a clustered column chart comparing the model prediction times for the
two datasets.
Figure 7. A clustered column chart comparing the model prediction times for the two datasets.
The novel prediction technique provided with Tribuo Hdbscan is able to make predic-
tions much faster than the Python module hdbscan. This technique, which uses computed √
cluster exemplars to estimate cluster assignments, has an asymptotic complexity of O( n)
as discussed above. The Python module hdbscan closely follows the HDBSCAN* algorithm
to make predictions which has a theoretical complexity of O(n2 ). In practice, the hdbscan
module leverages pre-computed structures stored in memory, so it should be somewhat
faster, but it is still slower than the technique proposed in this work.
of Tribuo Hdbscan shows that the first step of the algorithm which computes the core
distances, although improved, still takes about 35% of the total processing time. It is the
next step in the algorithm which computes the extended minimum spanning tree graph that
takes the largest percentage of the processing time, at 45%. Carefully examining the Python
module hdbscan shows that there are some advances made in these steps of the algorithm.
To compute the core distances, it leverages either the KDTree or BallTree [14] classes which
are optimized nearest neighbor algorithms. To compute the minimum spanning tree, a
technique described as a Dual tree Boruvka minimum spanning tree computation can be
employed. The authors of both [19,20] describe this technique in their work. Researching
these areas and reimplementing the first two steps of the HDBSCAN* algorithm in Tribuo
Hdbscan should significantly improve the performance of model training.
Author Contributions: Conceptualization, G.S. and M.A.-K.; methodology, G.S. and M.A.-K.; soft-
ware, G.S.; validation, G.S. and M.A.-K.; formal analysis, G.S.; investigation, G.S. and M.A.-K.;
resources, G.S. and M.A.-K.; data curation, G.S.; writing—original draft preparation, G.S.; writing—
review and editing, G.S. and M.A.-K.; visualization, G.S. and M.A.-K.; supervision, M.A.-K.; project
administration, G.S. and M.A.-K.; funding acquisition, G.S. and M.A.-K. All authors have read and
agreed to the published version of the manuscript.
Funding: This work is partially funded by Lewis University.
Institutional Review Board Statement: Not applicable.
Data Availability Statement: A publicly available dataset was analyzed in this study. These data can
be found here: [Link] accessed on 10 September 2021.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Al-sharoa, E.; Al-khassaweneh, M.A.; Aviyente, S. Detecting and tracking community structure in temporal networks: A low-
rank+ sparse estimation based evolutionary clustering approach. IEEE Trans. Signal Inf. Process. Over Netw. 2019, 5, 723–738.
[CrossRef]
2. Al-Sharoa, E.; Al-khassaweneh, M.; Aviyente, S. Low-rank estimation based evolutionary clustering for community detection in
temporal networks. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 5381–5385.
3. Campello, R.J.; Moulavi, D.; Sander, J. Density-based Clustering Based on Hierarchical Density Estimates. In Advances in
Knowledge Discovery and Data Mining, Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining; Pei, J.,
Tseng, V.S., Cao, L., Motoda, H., Xu, G., Eds.; Springer: Berlin, Germany, 2013; pp. 160–172.
4. McInnes, L.; Healy, J.; Astels, S. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2017, 2, 205. [CrossRef]
5. Machine Learning in Java—Tribuo. Available online: [Link] (accessed on 24 July 2021).
6. HDBSCAN* Clustering Tutorial. Available online: [Link]
(accessed on 22 January 2022).
7. Ester, M.; Kriegel, H.; Sander, J. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In
Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2 August
1996; pp. 226–231.
8. Campello, R.J.; Moulavi, D.; Zimek, A.; Sander, J. Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier
Detection. ACM Trans. Knowl. Discov. Data (TKDD) 2015, 10, 5–56. [CrossRef]
9. Ankerst, M.; Breunig, M.; Kriegel, H.; Sander, J. OPTICS: Ordering Points To Identify the Clustering Structure. In Proceedings of
the 1999 ACM SIGMOD International Conference on Management of Data, United States, Philadelphia, PA, USA, 1–3 June 1999;
ACM: New York, NY, USA, 1999; pp. 49–60.
10. Maimon, O.; Rokach, L. Clustering methods. In Data Mining and Knowledge Discovery Handbook, 2nd ed.; Springer:
Berlin/Heidelberg, Germany, 2006; pp. 321–352.
11. Dinh, D.T.; Fujinami, T.; Huynh, V.N. Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette
Coefficient. In KSS 2019: Knowledge and Systems Sciences; Chen, J., Huynh, V., Nguyen, G.N., Tang, X., Eds.; Communications in
Computer and Information Science; Springer: Singapore, 2019; Volume 1103.
12. McInnes, L.; Healy, J. Accelerated Hierarchical Density Based Clustering. In Proceedings of the 2017 IEEE International Conference
on Data Mining Workshops (ICDMW), New Orleans, LA, USA, 18–21 November 2017; pp. 33–42.
13. JavaVsPythonMLlibs. Available online: [Link] (accessed on 15
September 2021).
Appl. Sci. 2022, 12, 2405 21 of 21
14. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.;
et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
15. Sia, W.; Lazarescu, M. Clustering Large Dynamic Datasets Using Exemplar Points. In Machine Learning and Data Mining in
Pattern Recognition, Proceedings of the 4th International Conference on Machine Learning and Data Mining in Pattern Recognition, Leipzig,
Germany, 18–20 July 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 163–173.
16. TribuoHdbscan. Available online: [Link] (accessed on 10 November 2021).
17. IJava. Available online: [Link] (accessed on 22 September 2021).
18. Credit Card Dataset for Clustering. Available online: [Link] (accessed on 10
September 2021).
19. Curtin, R.; March, W.; Ram, P.; Anderson, D.; Gray, A.; Isbell, C. Tree-Independent Dual-Tree Algorithms. In Proceedings of the
30th International Conference on Machine Learning, Atlanta, GA, USA, 17 June 2013; pp. 1435–1443.
20. March, W.; Ram, P.; Gray, A. Fast Euclidean Minimum Spanning tree: Algorithm, Analysis, and Applications. In ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, Proceedings of the 16th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Washington, DC, USA, 24–27 July 2010; Association for Computing Machinery: New York,
NY, USA, 2010; pp. 603–612.