An Efficient and Scalable Search Engine For Models
An Efficient and Scalable Search Engine For Models
[Link]
Received: 22 February 2021 / Revised: 1 October 2021 / Accepted: 22 November 2021 / Published online: 27 December 2021
© The Author(s) 2021
Abstract
Search engines extract data from relevant sources and make them available to users via queries. A search engine typically
crawls the web to gather data, analyses and indexes it and provides some query mechanism to obtain ranked results. There
exist search engines for websites, images, code, etc., but the specific properties required to build a search engine for models
have not been explored much. In the previous work, we presented MAR, a search engine for models which has been designed
to support a query-by-example mechanism with fast response times and improved precision over simple text search engines.
The goal of MAR is to assist developers in the task of finding relevant models. In this paper, we report new developments
of MAR which are aimed at making it a useful and stable resource for the community. We present the crawling and analysis
architecture with which we have processed about 600,000 models. The indexing process is now incremental and a new index
for keyword-based search has been added. We have also added a web user interface intended to facilitate writing queries and
exploring the results. Finally, we have evaluated the indexing times, the response time and search precision using different
configurations. MAR has currently indexed over 500,000 valid models of different kinds, including Ecore meta-models,
BPMN diagrams, UML models and Petri nets. MAR is available at [Link]
123
relevance with respect to the query. Which is more, when Organization. Section 2 motivates the need for this work and
results come from different sources only manual comparison presents an overview of our approach and the running exam-
is possible. Altogether, searching for models is still an open ple, and it introduces the user interface of MAR. Section 3
problem in the MDE community. presents the crawling and analysis architecture. Sections 4
In this paper, we present the evolution of MAR, a search and 5 explain keyword-based and example-based search,
engine specifically designed for models. MAR consists of respectively. Section 6 discusses the indexing process. Sec-
several components, namely crawlers, model analysers, an tion 7 reports the results of the evaluation. Finally, Sect. 8
indexer, a REST API and a web user interface. The crawlers discusses the related work and Sect. 9 concludes.
and the analysers allow MAR to obtain models from known
sources and prepare them to be consumed by the rest of the
components. The indexer is in charge of organizing the mod- 2 Motivation and overview
els to perform fast and precise searches. At the user level,
MAR provides two search modes: keyword based and by- In this section, we motivate the need for model-specific
example. Keyword-based search provides a convenient way search engines through a running example. From this, a set of
to quickly locate relevant models, but it does not take into challenges that should be tackled is derived. Then, we present
account the structure of models. In the query-by-example an overview of our approach.
mode, the user creates a model fragment using a regular mod-
elling tool or a dedicated user interface (e.g. textual editor) 2.1 Motivation
and the system compares such fragment against the crawled
models. This allows the search process to take into account As a running example, let us consider a scenario in which a
the structure of the models. To perform this task efficiently, developer is interested in creating a DSL which reflects the
MAR encodes models by computing paths of maximum, con- form of state machines, and thus, it will include concepts
figurable length n between objects and attribute values of the like state machine, state, transition, etc. The developer is
model. These paths are stored in an inverted index [7] (a interested in finding Ecore meta-models which can be useful
map from items to documents containing such item). Thus, either for direct reuse, for opportunistic reuse [33] (copy–
given a query, the system extracts the paths of the query and paste reuse) or just for learning and inspiration purposes.
access the inverted index to perform the scoring of the rel- There are several places in which the developer could
evant models. We have evaluated the precision of MAR for try to find meta-models. For instance, the ATL meta-model
finding relevant models by automatically deriving queries zoo contains about 300 meta-models,2 but it is only possi-
using mutation operators, and systematically evaluating the ble to look up models using the text search facility of the
configuration parameters in order to find out a good trade- web browser. Using this search mechanism the user could
off between precision and the number of considered paths. find up to five meta-models which seems to represent state
Moreover, we evaluate its performance for both indexing and machines. This is shown in the upper image of Fig. 1. How-
searching obtaining results that show its practical applicabil- ever, the only way to actually decide which one is closer to
ity. Finally, MAR has currently indexed more than 500, 000 what the developer had in mind is to open the meta-models
models from several sources, including Ecore meta-models, in Eclipse and inspect them.
UML models and BPMN models. MAR is freely accessible It is also possible to look up meta-models in GitHub. In this
at [Link] case, a possible approach to find relevant models is by con-
This work is based on [40], which has been extended as structing a specific search query. For instance, to search for
follows: Ecore meta-models about state machines one would write a
query like: EPackage state machine extension:ecore. This query
– An improved crawling and analysis architecture which will be matched against .ecore files which contain the key-
allows us to grow the number of visited models up to word EPackage as a hint to find Ecore/XMI formatted files,
600,000. plus the state machine keywords. This is shown in the lower
– A new indexing process which is faster and it is now image of Fig. 1. Matched files are not validated (i.e. files
incremental. An indexer for keyword-based search has may contain a bad format), duplicates are likely to exist, and
also been included. the user can only see the XMI representation of the file. In
– The UI and the REST APIs have been greatly improved, addition, the number of results can be relatively large, which
which are hopefully a better resource for the community. exacerbates the problems mentioned before, since the user
– Extended evaluation, which also includes the indexing needs to manually process the models.
process and an analysis of the effect of the configura-
tion parameters in the search precision and the response 2 [Link]
time. title=Ecore.
123
AtlanMod
Meta-model
Zoo
Browser search
Query string
GitHub
search
Fig. 1 Search approaches for the AtlanMod meta-model zoo (upper part) and GitHub (lower part)
Therefore, there are a number of shortcomings that devel- an additional step of validating the models before using
opers face when trying to discover relevant models, namely: them.
– Heterogenous model sources. Models may be available In this setting, the lack of modelling platform able to
in several locations, which forces the user to remember address these issues has been the main driving force in the cre-
all these sources and manually check each one in turn. In ation of MAR. Its design, which is described next, is intended
contrast, a search engine provides a single entry point to to address these shortcomings.
discover models regardless of their locations.
– Model-specific queries. The unavailability of query mech- 2.2 Overview
anisms that target models specifically means that the user
needs to adapt the queries to the underlying platform and The design of a search engine for models must consider sev-
to give special interpretation to the results. eral dimensions [43]. One dimension is whether the search
– Ranking results. Results are typically not sorted using is meta-model based, in the sense that only models conform-
model-specific similarity criteria. In the best cases, key- ing to the meta-model of interest are retrieved. This is useful
words are used to drive the search. In some cases, there when the user wants to focus on a specific type of models and
is no sorting at all (e.g. in the AtlanMod Zoo). avoid irrelevant results. However, sometimes the user might
– Result inspection. The user would be interested in obtain- be interested in exploring all types of models as a first approx-
ing information about the model to make a decision about imation to the search. Another dimension is whether the
whether it is relevant or not. This information could be search engine is generic or specific for a modelling language.
specific renderings of the model (or parts thereof), a tex- There exist search engines for specific modelling languages
tual description or simple tags describing the topics of (e.g. UML [31] and WebML [15]), but the diversity of mod-
the model. elling approaches and the emergence of DSLs suggest that
– Model validation. When searching in source code repos- search engines should be generic. Approaches which rely on
itories, there is no guarantee that the obtained models exact matching are not adequate in this scenario since the user
are valid, which means that the user needs to perform is typically interested in obtaining results which are approx-
123
Query
imately similar to the query. Thus, the nature of the search hp://[Link]
State State
process requires an algorithm to perform inexact matching. Machine name: String
Web
Index
crawler
State State API
[Link] 10.7 IDE
account the similarity of the query with respect to each result. more… …
GenMyModel
A search engine requires an indexing mechanism for pro- Presentaon to the user Ranked list of results
– Model-specific queries. The queries created by the user The availability of a platform of this type would allow
target the contents of the models indexed by the search modellers to discover relevant models for their tasks. We
engine. There are two query mechanisms available. foresee scenarios like the following:
– Keyword-based queries. In this search mode, the
user interacts with the system by typing a few key- – Model reuse. Users who want to find models useful for
words and obtaining models whose contents match their projects may use the query-by-example facility to
them (see Sect. 4). This is useful for quick searches provide a description of the structure that relevant models
but the results are less precise. need to have in order to obtain them.
– Query-by-example. In this search mode, the user – Novel modellers may be interested in navigating models
interacts with the system by creating an example of certain type to learn from other, possibly more expe-
model (see Sect. 5). This example model contains ele- rience, modellers.
ments that describe the expected results. The system – Researchers who want to easily gather models (with
receives the model to drive the search and to present metadata attached) in order to perform experiments (e.g.
a ranked list of results. machine learning, analytics, etc.).
– Large enterprises hosting several model repositories
– Generic search. The engine is generic in the sense that
could leverage on a search engine to make their stored
it can index and search models when their meta-model
models easily browsable (i.e. having a local installation
is known. In the case of the query-by-example mode, the
of MAR).
search is meta-model based because it takes into account
the structure given by the meta-model.
– Crawling and indexing. An inverted index is populated 2.3 Using MAR in practice
with models crawled from existing software repositories
(such as GitHub and GenMyModel) and datasets (such This section provides a summary of the usage of MAR,
as [20]). We use Lucene as our index for keyword-based intended to give the reader a concrete view of how the features
search (Sect. 4) and the HBase database as our backend outlined in the previous section are materialized in practice.
for example-based queries (Sect. 6). The goal is to facilitate the understanding of the subsequent
– Ranking procedure. The results are obtained by access- sections.
ing the index to collect the relevant models among those Following with the running example about searching for
available. Each model has a similarity score which is models related to state machines, we consider two different,
used to sort the results. The IDE or the web frontend is but complementary use cases.
responsible for presenting the results in a user-friendly
manner. – Keyword-based search. In this use case, the user wants
– Frontend. The functionality is exposed through a REST to skim available models for state machines by quickly
API which can be used directly or through some tool typing related keywords. For instance, in Fig. 3 (upper
exposing its capabilities in a user-friendly manner. In par- part) the user writes state machine as keywords. This
123
Quick Keyword-based search by uploading an XMI file or by using a concrete syntax. For
overview
state machine Ecore, we support Emfatic syntax 3 .
When the user clicks on the Submit! button, the search
service is invoked, and it returns the list of results. Each
Query-by-example
result 5 contains the file name (which is a link to the original
State
StateMachine file), a description (if available), a set of topics (if available),
Detailed states * name: String
search statistics (e.g. the number of model elements) and quality
indicators (e.g. quality smells [41]).
InialState FinalState Many times the number of results can be large. MAR pro-
vides a facet-based search facility 4 with which the user
Fig. 3 Keyboard-based search and query-by-example can filter out models by several characteristics, namely: the
number of quality smells, elements, topics or change the sort
order.
type of query is more likely to return irrelevant results Finally, to facilitate the inspection of models, a rendering
(e.g. the keyword state machine alone may also return is also available 6 .
models related to factories which use machines). How- In addition to the user interface, a REST endpoint is also
ever, they are easy to craft and to refine. An additional available. This can be useful to empower modelling tools.
advantage is that a keyword-based query can naturally We foresee scenarios in which the search results are used
be used to perform queries across model types because to provide recommendations in modelling editors (similar to
it is possible to build an efficient common index for all the approach proposed in [53]). For instance, when the user
models. This can be useful to explore for which mod- is editing a model, MAR could be called in the background
elling formalisms there are interesting models available. and the results can be used to recommend new elements
For instance, a developer might be interested in models to be added to the model, model fragments that could be
representing ATMs and she might discover useful UML copy-pasted, etc. We also envision usages of the REST API
state machines and activity diagrams, BPMN models and to take advantage of the query facilities in order to imple-
even Petri nets that model this domain. She may want to ment approaches to recover the architecture of MDE projects
reuse, e.g. UML models but use BPMN models and Petri [24]. For instance, given an artefact which is not explicitly
nets as inspiration to complete and refine the UML mod- typed (i.e. its meta-models are not known for some reason),
els. its actual meta-model could be discovered by performing
– Query-by-example. In this use case, the user has already queries with inferred versions of its meta-model.
in mind the model elements and wants the results to be
relatively close to a proposed example. Thus, in our sys-
tem the user would create a query by crafting an example
model which represents the main characteristics of the
results that the user expects. Fig. 3 (bottom part) shows a
model fragment in which the user indicates that they are 3 Crawling and analysis
interested in state machines in which Initial and Final
states are modelled as classes. It is worth noting that This section describes how MAR extracts models from avail-
this imposes a requirement on the obtained results. This able resources, and how they are analysed. From a practical
means that, for instance, models in which the initial and point of view, this is a key component of a search engine,
final states are modelled using an enumeration will have since it is in charge of obtaining and feeding the data that
a low score. will be indexed and searched. Moreover, it poses a number
of technical challenges which are also discussed in this sec-
tion.
To illustrate the usage of MAR, Fig. 4 shows the web-based
Figure 5 shows the architecture that we have built. It has
user interface as it is used to perform a query-by-example
three main parts: crawlers which collect models, analysers
search.3 In this case, the user selects that it is interested in
which take these models and validate, compute statistics and
crafting an example 1 and chooses the type of the model,
extract additional metadata from them, and indexers which
Ecore in this case 2 . Then, a model fragment or example that
are in charge of organizing the models so that they can be
servers as query must be provided. This can be accomplished
searched in a fast manner. The goal of this architecture is
to automate the crawling and analysis of models as much as
3 The interface for keyword-based search is similar. It is accessed by possible and to have components that can be run and evolved
clicking on the Keywords button. independently.
123
5 Result
6 Model rendering
3 Model editor
<XMI:xmi …
Crawled
<ecore:EPackage files
</XMI:xmi>
GenMyModel/UML
UML Analyser UML
GenMyModel
Crawler GenMyModel/Ecore
GenMyModel Indexers
Ecore Analyser Ecore
(MAR and Lucene)
GitHub
Crawler GitHub/Ecore
{ .xmi, BPMN Analyser BPMN
.ecore,
.bpmn } GitHub/BPMN Analysis
Databases
Crawler
Databases
3.1 Model sources and new types of models as we discover them. An important
trait is that our search engine embeds all the knowledge that
The first challenge is the identification of model sources. we have gathered about model sources and how to process
Models can be found in different locations, like source code them, so that it provides an effective means for reusing this
repositories (e.g. GitHub, GitLab), dedicated model reposi- knowledge by simply searching for models.
tories (e.g. AtlanMod Zoo) or private repositories (e.g. within
an enterprise). There is no standard mechanism for dis-
covering these sources, and therefore this step is manual. 3.2 Crawling model sources
Moreover, many times there is an additional step of iden-
tification of relevant model types along with the required A search engine is more useful when it is able to provide
technology to process them. For instance, to process Archi- access to a greater amount of resources. Crawling is the
mate models we needed to gather its meta-model4 and process of gathering models from sources such as model
integrate it appropriately in our system. We have included in repositories, source code repositories and websites. Hence,
MAR those sources and types of models that we were already our crawlers are a key element for MAR. An important chal-
aware of, or that we have learned through our interaction lenge is that each repository or source of models has its own
with the MDE community. We plan to include new sources API or protocol for accessing its resources, or even no spe-
cific API at all (e.g. models listed in plain HTML pages). This
4 [Link] means that there is typically little reusability when imple-
123
For GitHub, we use PyGithub6 to interact with the GitHub – Compute the MD5 hash of the file contents and check if it
Search API. The GitHub API imposes two main limits which is duplicated with respect to an already analysed model.
requires special treatment. First, there is a maximum of If so, mark its status as duplicated.
15,000 API Requests per hour (for OAuth users). Moreover, it – Check if the model can be loaded (e.g. using EMF) and
is recommended to pause for a few seconds between searches. check its validity (e.g. using EMF validators). If the
Secondly, it returns up to 1024 results per search. This means, model is valid, mark as such so that it can be later used
that we need to add special filtering options to make sure that in the indexing process.
each search call always returns fewer than 1024 models. Our – Compute statistics about the model. The most basic
measure is the number of model elements, but we also
5 [Link] compute other specific metrics for some model types like
6 [Link] Ecore, UML, etc.
123
Table 1 Summary of the type of models crawled by MAR as September 30, 2022
Source Crawled Duplicates Failed Indexed Observations
Ecore GitHub 67,322 46,199 341 20,782 Based on the standard Ecore meta-model
GenMyModel 3987 3 27 3957
AtlanMod 304 1 4 299
UML GitHub 53,082 7282 1699 44,101 Based on the Eclipse UML implementation
GenMyModel 352,216 143 23,836 328,237
BPMN GenMyModel 21,285 0 200 21,085 Based on the EMF BPMN2 meta-model
Archimate GitHub 496 77 106 313 Based on the Archi meta-model
PNML GitHub 3291 1576 1044 671 Based on PNML framework
Sculptor GitHub 188 88 0 88 An application generator based on Xtext
RDS GenMyModel 91,411 108 515 90,788 Refers to entity/relationship diagrams
Meta-model semi-automatically inferred
Simulink Dataset [20] 200 0 0 200 Converted to EMF using Massif [4]
Total – 593,582 55,477 27,972 510,321 –
[Link]
[Link]
[Link]
123
MAR provides a keyword-based search mode in which the Finally, this set of pairs are stored in the Lucene inverted
user just types a set of keywords and the system returns index.
models that contain these input keywords. To implement this
functionality, we rely on existing techniques related to full- Moreover, in each document we also include metadata
text search, which has been widely studied in the literature about the corresponding model obtained in the crawling pro-
[59]. In our context, we assume that the string attributes of cess, such as the description and topics. Then, the documents
the models that we want to expose in the search engine act are indexed in an inverted index managed by Lucene.
as keywords. Thus, we consider each model to be a docu- Regarding the search process, given a set of keywords as
ment consisting of a list of keywords. The expectation is that query, Lucene accesses the inverted index and it ranks the
a model contains names which reflects its intention and the models according to their scores with respect to the query (it
user can reach relevant models by devising keywords similar uses the Okapi BM25 [51,59] scoring function).
to those appearing in the models. A feature of our keyword-based index is that all models are
Implementation-wise, we use Apache Lucene [2] to indexed together. This means that a query like state machine
implement this functionality, since it provides facilities for may return models of different types like Ecore meta-models
pre-processing documents, to index the documents and to representing state machines, but also, e.g. UML models in
perform the keyword-based search. To integrate the crawled which the keywords may appear. Then, the user could use
models with Lucene, each model goes through the following the UI filters to select the concrete type of models.
pipeline: The main disadvantage of keyword-based queries is that
the structure of the models is not taken into account. Thus,
1. Term extraction. We extract all the string attributes a query like state transition could not distinguish whether
(e.g. [Link] in Ecore) from the model. For the user is interested in models containing classes named
instance, the text document for the model shown in Fig. 3 State or Transition, or if they refer to references like states
would be: or transitions. Thus, one way to improve the precision of the
search is by providing model fragments as queries so that the
StateMachine states State name structure of the models can be taken into account.
InitialState FinalState
123
eType
vocabulary connects two classes. false
true
State
– L A R corresponds to the set of names of the differ- EPackage
false
EClass
eStructuralFeatures
eContainingClass
EAribute
eType
ent attributes. An edge whose label belongs to this ePackage eClassifiers
The procedure followed to transform a model into a multi- Fig. 8 Multigraph obtained from the model in Fig. 3. Object class nodes
graph is shown in Algorithm 1. This procedure receives a are depicted as rounded rectangles, while attribute nodes are depicted
as ovals
model as input, its meta-model and the set of all meta-classes
(C), references (R) and attributes (A) that will be taken into
account (i.e. those model elements whose meta-types are This graph is the interface of MAR with the different
not in these sets are filtered out). The output is a directed modelling technologies, that is, MAR can process any model
multigraph whose label languages L R , L A R and L C are deter-
that can be transformed to this graph. Up to now, we have
mined by R, A and C, respectively. In the output multigraph implemented the algorithm for EMF, but we foresee other
we can distinguish between two type of nodes: nodes labelled implementations for other types of models.
with class in μ1 (we call them object class nodes) and nodes
labelled with attribute in μ1 (we call them object attribute
nodes). For instance, the model in Fig. 3 is transformed into 5.2 Path extraction
the graph shown in Fig. 8 using Algorithm 1. In this example,
we consider as input for the algorithm: Once the graph associated with a model has been created, we
R ={all references} perform a path extraction process. The underlying idea is to
A = {[Link], [Link], use paths and subpaths found in the graph as the means to
[Link]} compare the example query against the models in the repos-
C = { ENamedElement } itory. In this context, the term bag of paths (BoP) refers to
the multiset of paths that have been extracted from a multi-
123
graph. A path that belongs to a BoP can have one of the 5.3 Scoring
following forms:
Given a query q (a model) and a repository of models M,
our goal is to compare each model of the repository with the
– Path of null length: (μ2V (v)) or
query. To achieve this, all the models in M are transformed
– Path of length greater than zero:
into a set of multigraphs and then paths are extracted obtain-
ing a set of bags of paths, BoPs = {BoP1 , . . . , BoPn }.
(μ2V (v1 ), μ2E (e1 ), . . . , μ2V (vn−1 ), μ2E (en−1 ), μ2V (n)), The query q follows the same pipeline and it is transformed
into BoPq . To perform the comparison between BoPq and
with n ≥ 2. bag BoPi of BoPs, that is, to compute the score
each
r BoPq , BoPi , we will use the adapted version of the scor-
ing function Okapi BM25 [51,59] used in [40]:
This definition of BoP establishes the form of the ele-
ments that should belong to this multiset and does not enforce
n+1
which paths must be computed. Therefore, extraction crite- cω (q)(z+1)cω (i)
2 log ,
|BoPi | d f (ω)
ria must be defined depending on the application domain. We 1 ω ∈ BoPq ∩ BoPi cω (i)+z 3 1 − b + b
avdl
follow the same criteria as our previous work [40] in which
only paths of these types are considered: (1)
– All paths of zero length for objects with no attributes. In where 1 ≤ i ≤ n, cω (q) is the number of times that a path
Fig. 8, the only object without attributes is the EPack- ω appears in a bag BoPq , cω (i) is the number of times that
age object (because we did not establish the package a path ω appears in a bag BoPi , d f (ω) is the number of
name). Therefore, the only example of this type of path bags in BoPs which have that path, avdl is the average of
is EPackage . the number of paths in all the BoPs in the repository and
– All paths of unit length (paths of length one) between z ∈ [0, +∞), b ∈ [0, 1] are hyperparameters. This function
attribute values and its associated object takes BoPi ∈ BoPs and BoPq , and returns a similarity score
class. In Fig. 8,
examples of this type of paths are states , −−−→
name, (the higher, the better). We have chosen this scoring function
−−−−−→ because it has three useful properties for our scenario:
EReference , State , −−−→ EClass ,
name, false , abstract,
EClass , etc.
– All simple paths with length less or equal than a threshold – It takes into account the paths that are in both bag of paths
(paths of length 3 or 4 are typical enough to encode the ( 1 box of Eq. (1)).
model structure, as determined empirically in Sect. 7) – It takes into account the paths that are very common in
between attributes and between attributes and objects the bags of the entire repository ( 2 box of Eq. (1)).
without – It penalizes models which have a large size ( 3 box of
attributes. In Fig. 8, examples−−
of this type of paths
−−−−−−→
states , −−−→ EReference , containment,
are name, true
Eq. (1)) to prevent the effect caused by such models hav-
ing significantly more matches. This is controlled by the
, StateMachine , − −−→ EClass , −
name,
−−−−−−−−−−−−→
eStructuralFeatures,
hyperparameter b. If b → 1, larger models have more
−−−−−−−→
EReference ,− −−→ states
name, , EPackage , eClassifiers, penalization. We take b = 0.75 as default value.
−−−→
EClass ,name, InitialState , StateMachine , − −−→
name,
−−−−−−−−−−−−−→ −−−→
EClass , eStructuralFeatures, EReference , eType, EClass , −−−→
name, It is important to remark that the hyperparameter z controls
State , etc. how quickly an increase in the path occurrence frequency
results in path frequency saturation. The bigger z is, the
slower saturation we will have. We take z = 0.1 as default
The underlying idea is that paths of zero and unit length value.
encode the
data of an object (its internal state). For instance, Therefore, to perform a query, for each element in BoPs
the path State , −−−→ EClass means that the model has an
name, its score with respect to BoPq is computed. Then, the repos-
EClass whose name is State. On the other hand, paths of length itory is sorted according to the computed scores and the top
greaterthan one encode the model structure. For instance, the k models are retrieved (the k models with highest score).
path −−−→ EClass , −
StateMachine , name,
−−−−−−−−−−−−→
eStructuralFeatures,
While this is a straightforward approach, it is not efficient.
EReference ,−−−→
name, statesmeans that the query expects Instead, an index structure is needed in order to make the
the existence of a class named StateMachine with a reference scoring step scalable and efficient. Next section presents the
named states. implementation of this structure.
123
In a typical text retrieval scenario, an indexer organizes the We use Apache HBase [1,30] to implement the inverted
documents in an appropriate data structure for providing a index. HBase is a sparse, distributed, persistent, multidimen-
fast response to the queries. In our case, which is a model sional, sorted, map database inspired by Google BigTable
retrieval scenario, the indexer organizes models instead of [19]. The HBase data model consists of tables, and each table
documents. As in every search engine, the indexing module has rows identified by row keys, each row has some columns
is crucial when we look for scalability and the management (qualifiers) and each column belongs to a column family and
of large repositories. has an associated value. A value can have one or multiple
versions identified by a timestamp. Therefore, given a table,
a particular value has 4 dimensions:
6.1 Inverted index
(Row, Column Family, Column Qualifier, Timestamp)
The main data structure used by the indexer is the inverted
−→ Value
index [59]. In a text retrieval context, the inverted index con-
sists of a large array in which each row is associated with a
Due to the design of HBase, the Column Family and
word in the global vocabulary and points to a list of docu-
Timestamp dimensions should not have a large number of
ments that contain this word.
distinct values (i.e. three or four distinct values). On the other
In our context, instead of words we have paths and instead
hand, there is no restriction in the Row and Column Qualifier
of documents we have models. This is a challenge that we
dimensions.
have to deal with since it causes two problems:
Regarding reading operations, HBase provides two oper-
ations: scan (iterate over all rows that satisfy a filter) and get
(random access by key to a particular row). With respect to
– Large amount of paths per query. In contrast to a text
writing, the main operation is put, to insert a new value given
retrieval scenario in which a query is composed by a
its dimensions (row, column family and column qualifier).
small number of keywords, in query-by-example, a small
This operation does not overwrite if multiple versions are
model/query can be composed by an order of hundreds
allowed.
of different paths.
Designing an HBase schema for an inverted index can be
– Huge inverted index. In a text retrieval scenario, the num-
relatively straightforward: each path is a row key and the
ber of words in the vocabulary has an order of hundred of
models that contain this path are columns whose qualifier
thousands, whereas, in model retrieval, we have an order
is the model identifier. However, this approach suffers from
of millions different paths.
one of the problems that we have explained previously: lots
of accesses (get) per query. Therefore, in order to avoid this
These two issues translate in practice into a lot of accesses problem, we use the schema presented in [40]. Each path is
per query to a huge inverted index. Therefore, the design of split into two parts: the prefix and the rest. The prefix is used
an appropriate and more complex structure beyond the basic as row key whereas the rest is used as column qualifier. The
inverted index is needed. The main ingredients that we use split point of the path depends on whether it starts with an
to tackle this are: attribute node (the prefix is the first sub-path of unit length)
or an object class node (the prefix is the first node). The value
associated with the column is a serialized map which asso-
– The inverted index is built over HBase [1]. This database ciates the identifier of the models which have this path with
has a distributed nature. Therefore, the table that will be a tuple composed by the number of times the path appears
associated with the inverted index is split into chunks and in the models and the total number of paths of the models.
they may be distributed over the nodes of a cluster. This This tuple is used by the system to compute the scores in an
design decision makes the system scalable with respect efficient manner. Therefore, this schema has the following
to the size of the index. form:
– Special HBase schema. We have designed an HBase
schema to accommodate an inverted index particularly (Prefix, Column Family, Rest, Timestamp) −→
oriented to resolve path-based queries. The underlying Map ModelID → (number of times, model size)
idea is to split the paths in two parts. Doing this, the
number of accesses per query to the inverted index is Thus, now a path is the concatenation of the prefix (row
reduced. This is explained next. id) and the rest (column qualifier). The column family is
123
CF Incremental mode
Prefix
Ecore Inverted Index Insertion Insertion Insertion Merge Insertion Insertion Insertion Merge ...
ROW KEY COLUMN FAMILY value
... ...
Fig. 10 Possible sequence of insertion and merge operations in the
(state, name, ) = {id1: [1,1032], id3: [1,891], id10: incremental mode. In this case, the number of distinct versions that the
Timestamp1
EClass [1,506], ...}
) = {id5: [1,280], id6: [1,891], id23:
column family allows is three
[1,678], ...}
Timestamp2
Rest
...
123
2
1 3 4 5
2.1 2.2
...
...
... ...
1 2 3
4 5
... ...
... ...
using Algorithm 1 (label 2.1 in Fig. 11). After that, the 6.3.2 Merge
path extraction process is performed getting the set of
bags of paths {BoP1 , . . . , BoPn } (label 2.2 in Fig. 11). HBase is able to resolve read operations in the presence of
3. Emit key-value pairs. The key is the processed path and versions. However, the number of versions of a column fam-
the value is a tuple (model id,statistics) (label 3 ). For each ily is limited by an upper bound in order to control the size
BoP and for each distinct path that belongs to this BoP of the HFile (the distributed file that stores the column fam-
we emit a pair key-value with the form ( p, (id, (a, b))) ily). Therefore, it is necessary to perform a merge operation
where p is the path, id is the model identifier that the before the number of versions grows too large (i.e. in our
BoP is extracted from, a is the number of times the path implementation we set the maximum number of versions to
appears in the BoP and b is total number of paths of the four).
BoP. Here, p is the key and (id, (a, b)) is the value. According to our HBase schema, a value associated with
4. Group by key and merge the list of values (from the key- a version in a column will be a serialized map of models that
value pairs) into a serialized map (label 4 ). We group the contains a path (the path is given by the concatenation of row
pairs emitted in the previous step according to the path key and the column qualifier). Therefore, these versions can
(key) and the list of values associated with this key is be merged into a single one (i.e. these maps can be merged
transformed into a map of the form {id1 : (a1 , b1 ), id2 : into a single one). The merge subprocess is a Spark [58] script
(a2 , b2 ), ...}. For instance, as we can observe in Fig. 11, which does the following (Fig. 12):
the map associated with the p2 path has the elements
1 : (a p2 ,1 , b1 ) and 3 : (a p2 ,3 , b3 ) since BoP1 and BoP3
contain the path p2 .
5. Split the paths and store them in the HBase table (label 1. Read the table associated with the inverted index and
5 ). Each path is split into a prefix and the rest according extract all tuples (prefix, rest, timestamp, map) (label 1
to the HBase schema presented in Sect. 6.2 and it is stored in Fig. 12).
in the inverted index table with its associated map using 2. Ignore timestamp dimension and emit pairs key-value
the HBase put operation. (label 2 in Fig. 12). Here, the key is (prefix, rest) and
the value is the serialized map.
3. Group by the key, that is, coordinates (prefix, rest). This
This step is similar in both the batch and the incremental corresponds to step 3 in Fig. 12 in which (prefixi , resti )
modes. The only difference is that in incremental mode a are the distinct keys and each [map] is the list of maps
timestamp is attached to each inserted value to allow multiple associated with a key.
versions. This is possible because HBase allows multiple 4. Merge the list of maps associated with a key (prefix , rest )
versions in column families. Therefore, when a put operation and generate a new map. This step corresponds to the label
is executed and the key already exists the last value is not 4 in Fig. 12. Here, (prefixi , resti ) are the distinct keys and
removed, but a new version is added. mapi are the merged map.
123
Ecore Inverted Index that others can install it locally and perform experiments with
ROW KEY COLUMN FAMILY it.
... ...
123
400
1 Extract connected subset Select a root element and pick
classes reachable via references or
subtype/supertype relationships, up
200 to a given length.
2 Remove inheritance Remove a random inheritance link
5000 10000 15000
(up to 20%).
Models
3 Remove leaf classes Remove classes, but prioritize those
which are farther from the root ele-
Fig. 14 Non-incremental mode performance. The x-axis indicates the
ment (up to 30%)
number of models that the process has to index and the y-axis the time
in seconds. MAR scales smoothly as the size of the index increases 4 Remove references Remove a random reference (up to
30%).
150 5 Remove enumeration Remove random enumerations or
literals (50%).
125
6 Remove attributes Remove a random attribute (up to
Time (seconds)
Operation 30%).
100
insertion
merge 7 Remove low-df classes Remove elements whose name is
75 “rare”.
8 Rename from cluster Replace a name by another name
50
corresponding to element belonging
to the same meta-model cluster (up
2º 00
3º 0 0
4º 00
5 º 00
1º 0 0 0
6º ge
7 º 00
8º 0 0
9 º 00
2º 0 0 0
1 0 rge
11 000
12 000
13 000
3º 0 0 0
1 4 rge
15 000
16 000
17 000
4º 0 0 0
ge
10
10
10
10
er
10
10
10
er
e
e
1
º1
º1
º1
º1
º1
º1
º1
º1
m
m
1º
123
123
Table 6 Mean and max statistics of the query response times (in sec- Table 7 Mean and max statistics of the number of paths
onds)
Small Medium Large
Small Medium Large Mean Max Mean Max Mean Max
Mean Max Mean Max Mean Max
All-2 66.0 195 178.7 892 437.1 1578
All-2 0.13 0.25 0.24 0.51 0.35 1.82 All-3 238.2 916 789.1 4337 2001.0 8600
All-3 0.29 1.02 0.66 2.07 0.97 3.63 All-4 619.7 4997 2849.0 21282 10340.0 48803
All-4 0.98 2.75 2.36 6.60 3.47 9.18 Names-2 22.6 94 65.5 284 172.0 552
Names-2 0.03 0.09 0.06 0.16 0.09 1.08 Names-3 91.6 544 366.1 2030 1059.4 5069
Names-3 0.05 0.21 0.11 0.41 0.19 1.64 Names-4 284.5 3687 1679.0 13108 7724.0 37474
Names-4 0.08 0.41 0.28 1.41 0.66 3.85
50000
7.5
Time (seconds)
40000
Size
5.0
30000 Size Small
Paths
Medium
Small Large
Medium
20000 Large
2.5
10000
0.0
0 All−2 All−3 All−4 Names−2 Names−3 Names−4
Configuration
All−2 All−3 All−4 Names−2 Names−3 Names−4
Configuration
Fig. 17 Boxplot of the query response time. The x-axis represents the
configurations considered, the y-axis the time in seconds and the colour
Fig. 16 Boxplot of the number of paths per query. The x-axis represents
the size of the queries
the configurations considered, the y-axis the number of paths of a query
123
are queries with up to 193 classes. This suggests that MAR provide a search layer on top of third-party repositories from
could be applicable in scenarios like model clone detection. which models can be extracted.
MDEForge is a collaborative modelling platform [13]
intended to foster “modelling as a service” by providing facil-
8 Related work ities like model storage, search, clustering, workspaces, etc.
It is based on a megamodel to keep track of the relation-
In this section, we review works related to our proposal. We ships between the stored modelling artefacts. One goal of
organize these works in five categories: model encoding tech- MDEForge is to remove the need of installing desktop-based
niques, model repositories, mining and analysis of model tooling and to enable collaboration between users. As such,
repositories, search engines and model clone detection. the services offered by MDEForge are focused on its regis-
tered users and their resources. In contrast, MAR is intended
to offer a simpler and more open search service.
8.1 Model encoding techniques Img2UML [35] is a repository of UML models obtained
by collecting images of UML diagrams by searching for
The concept of BoP, which is a key element of our query- Google images. These images are converted to XMI files, and
by-example approach, is inspired by the notion of bag of there is a search system based on querying names of specific
path-context defined in Code2vec [6]. In this paper, the properties (e.g. class names). The use of this repository in a
authors study the transformation of code snippets into con- teaching environment has been studied in [36], showing its
tinuous distributed vectors by using a neural network with usefulness for improving students’ designs. MAR could be
an attention mechanism whose input is a multiset of paths used a similar setting, using its example-based facilities to
extracted from the AST of a code snippet. improve the search process.
SAMOS is a platform for model analytics. It is used in GenMyModel is a cloud service which provides online
[8] for clustering and in [10] for meta-model clone detection. editors to create different types of models and stores thou-
SAMOS’s approach is very close to ours since both transform sands of public models [3]. Our crawlers target GenMy-
models into graphs and extract some paths (the authors call Model, using the REST API it offers in order to provide
them n-grams). However, there are some differences between search capabilities for its models.
MAR and SAMOS. First, our graph is different as we con- ReMoDD is a repository of MDE artefacts of different
sider single attributes as independent nodes, and therefore nature, which is available through a web-based interface [29].
the extracted paths are different. And second, SAMOS is not It is intended to host case studies, rather than individual files.
designed to perform a search since it uses similarity metrics In addition, it does not include specific search facilities.
which are not thought to perform a fast search. Hawk provides an architecture to index models [11], typ-
AURORA [45] extracts tokens from Ecore models and ically for private projects and it can be queried with OCL.
uses a neural network to classify them in domains (e.g. This means that the query result will only contain models that
state machines, database, bibliography, etc.). These tokens satisfy the OCL expression. In contrast, the main use case of
are treated as words and the input of the neural network is MAR is inexact matching, which means that the resulting
a TF-IDF vector. In AURORA, paths are encoded by con- models do not necessarily match the complete query.
catenating tokens. MEMOCNN [46] can be considered the
evolution of AURORA. It uses an encoding schema simi- 8.3 Mining and analysis of model repositories
lar to AURORA but MEMOCNN transforms meta-models
into images and they are used to feed a convolutional neural There are some works devoted to crawl and analyse models.
network that performs the classification. ModelMine [50] is a tool to mine models from GitHub, along
Clarisó and Cabot propose the use of graph kernels for with additional information such as versioning. It is similar
clustering models [21]. The general idea is to transform mod- to our approach for crawling GitHub, except that our system
els into graphs and perform machine learning methods by uses file sizes instead of dates for filtering results. Another
using graph kernels [47] as similarity measure between graph difference, is that it does not provide specific capabilities
models. for validating the retrieved models. Similarly, Restmule is a
tool to help in the creation of REST APIs to mine data from
8.2 Model repositories public repositories [52]. We plan to use it and compare its
performance and reliability against our system.
The study performed by Di Rocco [26] shows that the search The work by Barriga et al. proposes a tool-chain to
facilities provided by existing model repositories are typi- analyse repositories of Ecore models [12]. To prove its use-
cally keyword based, tag based or there is no search facility fulness, they analyse a repository of 2420 ecore models. Their
at all. Our aim with MAR is to offer a search platform able to approach is similar to the analysis module of MAR. However,
123
Table 8 Summary of search engine for models approaches adapted and extended from [14]
Query format Search type Models Index Crawler Megamodel aware
our validators can handle models of different types (not only MOOGLE [42] is a generic search engine that performs
Ecore meta-models), they are fault-tolerant and we have used text search and the user can specify the type of the desired
them to analyse an order of magnitude more models. model element to be returned. MOOGLE uses the Apache
A statistical analysis is carried out in repositories of ∼ 450 Lucene query syntax and Apache SOLR as the backend
ecore models by Di Rocco et al. [25]. In particular, the search engine. In [14], a similar approach to MOOGLE is
authors studied relations between structural features of the proposed with some new features like megamodel aware-
meta-models (for instance, correlation between featureless ness (consideration of relations among different kinds of
meta-classes and number of meta-classes with a supertype) artefacts).
and their distributions (for instance, the distribution of iso- MoScript is proposed in [38], a DSL for querying and
lated classes). On the other hand, a similar analysis is carried manipulating model repositories. This approach is model-
out by the same authors in [27] in the context of ATL trans- independent, but the user has to type complex OCL-like
formations together with the source and target meta-models. queries that only retrieve exact models, not a ranking list
SAMOS is used to analyse and cluster the S.P.L.O.T. of models.
repository [9]. First, they try to get a high-level overview The work by Dijkman [28] investigates the application
of the repository by considering large clusters. Then, they of graph matching algorithms to ranking BPMN given an
carry out a more fine-grained clustering to discover model example as the query. After that, in [57], the greedy algo-
clones or duplicates. rithm (studied in [28]) is improved obtaining a procedure of
computing similarity between BPMN graphs ten times faster.
On the other hand, Cao [18] proposes to use the Hungarian
algorithm to query similar Petri nets.
8.4 Search engines In the work by Kalnina [34], a tool for finding model frag-
ments in a DSL tool is proposed. The fragments are expressed
A number of search engines for models have been proposed. using concrete syntax and a graph matching procedure is used
However, as far as we know, there is no one widely used. In as the search mechanism.
Table 8, the proposed search engines are shown along with Genetic algorithms with structural and syntactic metrics
an analysis of their features. were used to compare two meta-models in the work by
A search engine of WebML models is presented by Bis- Kessentini [37]. However, this approach is not intended to
limovska in [15]. It supports two types of queries: text queries be fast since it uses genetic algorithms.
and query-by-example. This search engine uses an inverted
index so the response is fast. On the other hand, when an
example is given as a query, the similarity between mod- 8.5 Model clone detection
els is computed by using a variation of the A∗ algorithm.
Therefore, the query’s response is slow since an iteration of It is worth including this topic in the related work section
the full repository is needed to obtain the final ranked list. because it can be seen as a type of approximate match-
On the other hand, the work by Gomes [31] focuses on the ing. Model clone detection is a relatively new topic [10]
retrieval of UML models using the combination of WordNet and much research has been devoted to deal with chal-
and Case-Based Reasoning. lenges such as scalability (models can be large graphs [22]),
123
tool-specific representations, internal identifiers, and abstract employee also returns models related to workers). Another
versus concrete syntaxes [10,54]. However, model clone line of work is the application of machine learning tech-
detection approaches have not been applied to search in large- niques that may help us automate some tasks like generating
scale repositories, since these techniques are more focussed descriptive tags, automatically classifying models in order
on pairwise model matching. In our case, we have a scor- to improve faceted search and to consider quality features in
ing algorithm that is global in the sense that model paths are the ranking algorithm by using a learning to rank approach
stored together in the index. This makes our approach more [17,32].
scalable in terms of the size of the repository.
In [10], SAMOS is applied to meta-model clone detection. Acknowledgements This work was funded by a Ramón y Cajal 2017
Grant (RYC-2017-237), funded by MINECO (Spain) and co-funded
A study of the UML clones is carried out in [54] and [55] by the European Social Fund. José Antonio Hernández López enjoys a
where a tool to detect clones (called MQlone ) is presented and FPU Grant funded by the Universidad de Murcia.
a description of clone types in UML is provided.
Exas encodes a graph model into a vector to perform Funding Open Access funding provided thanks to the CRUE-CSIC
agreement with Springer Nature.
approximate matching [44,48]. This technique has also been
applied to detect model transformation clones [56]. Open Access This article is licensed under a Creative Commons
On the other hand, clone detectors in Simulink models Attribution 4.0 International License, which permits use, sharing, adap-
have been studied in the literature. For instance, an approach tation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
of clone detection in data flow languages is presented in [23] source, provide a link to the Creative Commons licence, and indi-
which is based on graph theory and it is applied on model cate if changes were made. The images or other third party material
clone detection in Simulink. SIMONE (an extension of text- in this article are included in the article’s Creative Commons licence,
based tool NICAD) is presented in [5] to detect clones in unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your
Simulink models. intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copy-
right holder. To view a copy of this licence, visit [Link]
9 Conclusions [Link]/licenses/by/4.0/.
123
on Model Driven Engineering Languages and Systems: Companion 31. Gomes, P., Pereira, F.C., Paiva, P., Seco, N., Carreiro, P., Ferreira,
Proceedings, pp. 1–8 (2020) J.L., Bento, C.: Using wordnet for case-based retrieval of UML
13. Basciani, F., Di Rocco, J., Di Ruscio, D., Di Salle, A., Iovino, models. AI Commun. 17(1), 13–23 (2004)
L., Pierantonio, A.: Mdeforge: an extensible web-based modeling 32. He, C., Wang, C., Zhong, Y.X., Li, R.F.: A survey on learning to
platform. In: CloudMDE@ MoDELS, pp. 66–75 (2014) rank. In: 2008 International Conference on Machine Learning and
14. Basciani, F., Di Rocco, J., Di Ruscio, D., Iovino, L., Pierantonio, Cybernetics, vol. 3, pp. 1734–1739. IEEE (2008)
A.: Exploring model repositories by means of megamodel-aware 33. Holmes, R., Walker, R.J.: Systematizing pragmatic software reuse.
search operators. In: MODELS Workshops, pp. 793–798 (2018) ACM Trans. Softw. Eng. Methodol. (TOSEM) 21(4), 1–44 (2013)
15. Bislimovska, B., Bozzon, A., Brambilla, M., Fraternali, P.: Textual 34. Kalnina, E., Sostaks, A.: Towards concrete syntax based find for
and content-based search in repositories of web application models. graphical domain specific languages. In: 2019 ACM/IEEE 22nd
ACM Trans. Web (TWEB) 8(2), 1–47 (2014) International Conference on Model Driven Engineering Languages
16. Bucchiarone, A., Cabot, J., Paige, R.F., Pierantonio, A.: Grand and Systems Companion (MODELS-C), pp. 236–242. IEEE (2019)
challenges in model-driven engineering: an analysis of the state 35. Karasneh, B., Chaudron, M.R.: Online Img2UML repository: an
of the research. Softw. Syst. Model. 1–9 (2020) online repository for UML models. In: EESSMOD@ MoDELS,
17. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamil- pp. 61–66. Citeseer (2013)
ton, N., Hullender, G.: Learning to rank using gradient descent. 36. Karasneh, B., Jolak, R., Chaudron, M.R.: Using examples for teach-
In: Proceedings of the 22nd international conference on Machine ing software design: an experiment using a repository of uml class
learning, pp. 89–96 (2005) diagrams. In: 2015 Asia-Pacific Software Engineering Conference
18. Cao, B., Wang, J., Fan, J., Yin, J., Dong, T.: Querying similar pro- (APSEC), pp. 261–268. IEEE (2015)
cess models based on the Hungarian algorithm. IEEE Trans. Serv. 37. Kessentini, M., Ouni, A., Langer, P., Wimmer, M., Bechikh, S.:
Comput. 10(1), 121–135 (2016) Search-based metamodel matching with structural and syntactic
19. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., measures. J. Syst. Softw. 97, 1–14 (2014)
Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a dis- 38. Kling, W., Jouault, F., Wagelaar, D., Brambilla, M., Cabot, J.:
tributed storage system for structured data. ACM Trans. Comput. Moscript: A dsl for querying and manipulating model repositories.
Syst. (TOCS) 26(2), 1–26 (2008) In: International Conference on Software Language Engineering,
20. Chowdhury, S.A., Varghese, L.S., Mohian, S., Johnson, T.T., Csall- pp. 180–200. Springer (2011)
ner, C.: A curated corpus of simulink models for model-based 39. Kolovos, D., De La Vega, A., Cooper, J.: Efficient generation of
empirical studies. In: 2018 IEEE/ACM 4th International Work- graphical model views via lazy model-to-text transformation. In:
shop on Software Engineering for Smart Cyber-Physical Systems Proceedings of the 23rd ACM/IEEE International Conference on
(SEsCPS), pp. 45–48. IEEE (2018) Model Driven Engineering Languages and Systems, pp. 12–23
21. Clarisó, R., Cabot, J.: Applying graph kernels to model-driven (2020)
engineering problems. In: Proceedings of the 1st International 40. López, J.A.H., Cuadrado, J.S.: Mar: A structure-based search
Workshop on Machine Learning and Software Engineering in Sym- engine for models. In: Proceedings of the 23rd ACM/IEEE Inter-
biosis, pp. 1–5 (2018) national Conference on Model Driven Engineering Languages and
22. Deissenboeck, F., Hummel, B., Juergens, E., Pfaehler, M., Schaetz, Systems, pp. 57–67 (2020)
B.: Model clone detection in practice. In: Proceedings of the 4th 41. López-Fernández, J.J., Guerra, E., De Lara, J.: Assessing the qual-
International Workshop on Software Clones, pp. 57–64 (2010) ity of meta-models. In: MoDeVVa@ MoDELS, pp. 3–12. Citeseer
23. Deissenboeck, F., Hummel, B., Jürgens, E., Schätz, B., Wag- (2014)
ner, S., Girard, J.F., Teuchert, S.: Clone detection in automotive 42. Lucrédio, D., Fortes, R.P., Whittle, J.: Moogle: A model search
model-based development. In: 2008 ACM/IEEE 30th International engine. In: International Conference on Model Driven Engineering
Conference on Software Engineering, pp. 603–612. IEEE (2008) Languages and Systems, pp. 296–310. Springer (2008)
24. Di Rocco, J., Di Ruscio, D., Härtel, J., Iovino, L., Lämmel, R., 43. Lucrédio, D., Fortes, R.P., Whittle, J.: MOOGLE: a metamodel-
Pierantonio, A.: Understanding mde projects: megamodels to the based model search engine. Softw. Syst. Model. 11(2), 183–208
rescue for architecture recovery. Softw. Syst. Model. 19(2), 401– (2012)
423 (2020) 44. Nguyen, H.A., Nguyen, T.T., Pham, N.H., Al-Kofahi, J.M.,
25. Di Rocco, J., Di Ruscio, D., Iovino, L., Pierantonio, A.: Mining Nguyen, T.N.: Accurate and efficient structural characteristic fea-
metrics for understanding metamodel characteristics. In: Proceed- ture extraction for clone detection. In: International Conference on
ings of the 6th International Workshop on Modeling in Software Fundamental Approaches to Software Engineering, pp. 440–455.
Engineering, pp. 55–60 (2014) Springer (2009)
26. Di Rocco, J., Di Ruscio, D., Iovino, L., Pierantonio, A.: Collabo- 45. Nguyen, P.T., Di Rocco, J., Di Ruscio, D., Pierantonio, A., Iovino,
rative repositories in model-driven engineering [software technol- L.: Automated classification of metamodel repositories: a machine
ogy]. IEEE Softw. 32(3), 28–34 (2015) learning approach. In: 2019 ACM/IEEE 22nd International Con-
27. Di Rocco, J., Di Ruscio, D., Iovino, L., Pierantonio, A.: Mining ference on Model Driven Engineering Languages and Systems
correlations of atl model transformation and metamodel metrics. (MODELS), pp. 272–282. IEEE (2019)
In: 2015 IEEE/ACM 7th International Workshop on Modeling in 46. Nguyen, P.T., Di Ruscio, D., Pierantonio, A., Di Rocco, J., Iovino,
Software Engineering, pp. 54–59. IEEE (2015) L.: Convolutional neural networks for enhanced classification
28. Dijkman, R., Dumas, M., García-Bañuelos, L.: Graph matching mechanisms of metamodels. J. Syst. Softw. 172, 110860 (2021)
algorithms for business process model similarity search. In: Inter- 47. Nikolentzos, G., Siglidis, G., Vazirgiannis, M.: Graph kernels: a
national Conference on Business Process Management, pp. 48–63. survey. arXiv preprint arXiv:1904.12218 (2019)
Springer (2009) 48. Pham, N.H., Nguyen, H.A., Nguyen, T.T., Al-Kofahi, J.M.,
29. France, R., Bieman, J., Cheng, B.H.: Repository for model Nguyen, T.N.: Complete and accurate clone detection in graph-
driven development (remodd). In: International Conference on based models. In: 2009 IEEE 31st International Conference on
Model Driven Engineering Languages and Systems, pp. 311–317. Software Engineering, pp. 276–286. IEEE (2009)
Springer (2006) 49. Porter, M.F.: An algorithm for suffix stripping. Program (1980)
30. George, L.: HBase: the definitive guide: random access to your 50. Reza, S.M., Badreddin, O., Rahad, K.: Modelmine: a tool to facili-
planet-size data. O’Reilly Media, Inc. (2011) tate mining models from open source repositories. In: Proceedings
123
of the 23rd ACM/IEEE International Conference on Model Driven José Antonio Hernández López
Engineering Languages and Systems: Companion Proceedings, pp. is a predoctoral researcher at the
1–5 (2020) University of Murcia. He obtained
51. Robertson, S., Zaragoza, H.: The Probabilistic Relevance Frame- a [Link]. in Mathematics, a [Link]. in
work: BM25 and Beyond. Now Publishers Inc, London (2009) Computer Science and a [Link]. in
52. Sanchez, B.A., Barmpis, K., Neubauer, P., Paige, R.F., Kolovos, Big Data from the University of
D.S.: Restmule: enabling resilient clients for remote apis. In: Pro- Murcia. Currently, he is enrolled
ceedings of the 15th International Conference on Mining Software in a Ph.D. programme in Com-
Repositories, pp. 537–541 (2018) puter Science supervised by Jesús
53. Stephan, M.: Towards a cognizant virtual software modeling assis- Sánchez Cuadrado. His research
tant using model clones. In: 2019 IEEE/ACM 41st International interests include model-driven
Conference on Software Engineering: New Ideas and Emerging engineering (MDE), recommender
Results (ICSE-NIER), pp. 21–24. IEEE (2019) systems and machine learning for
54. Störrle, H.: Towards clone detection in UML domain models. software engineering.
Softw. Syst. Model. 12(2), 307–329 (2013)
55. Störrle, H.: Effective and efficient model clone detection. In: Soft-
ware, Services, and Systems, pp. 440–457. Springer (2015) Jesús Sánchez Cuadrado is a
56. Strüber, D., Acreţoaie, V., Plöger, J.: Model clone detection for Ramón y Cajal researcher at the
rule-based model transformation languages. Softw. Syst. Model. Languages and Systems Depart-
18(2), 995–1016 (2019) ment of the University of Murcia.
57. Yan, Z., Dijkman, R., Grefen, P.: Fast business process similarity His research is focused on model-
search. Distrib. Parallel Databases 30(2), 105–144 (2012) driven engineering (MDE) topics,
58. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, notably model transformation lan-
A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, guages, meta-modelling and do
A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified main-specific languages. Lately, he
engine for big data processing. Commun. ACM 59(11), 56–65 has been involved in the construc-
(2016). [Link] tion of the MAR search engine
59. Zhai, C., Massung, S.: Text data management and analysis: a prac- ([Link] On these
tical introduction to information retrieval and text mining (2016) topics, he has published several
articles in journals and peer-revie
wed conferences, and developed
Publisher’s Note Springer Nature remains neutral with regard to juris- several open-source tools. His e-mail address is jesusc@[Link] and his
dictional claims in published maps and institutional affiliations. web-page is [Link]
123
1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@[Link]