0% found this document useful (0 votes)
93 views24 pages

An Efficient and Scalable Search Engine For Models

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views24 pages

An Efficient and Scalable Search Engine For Models

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Software and Systems Modeling (2022) 21:1715–1737

[Link]

SPECIAL SECTION PAPER

An efficient and scalable search engine for models


José Antonio Hernández López1 · Jesús Sánchez Cuadrado1

Received: 22 February 2021 / Revised: 1 October 2021 / Accepted: 22 November 2021 / Published online: 27 December 2021
© The Author(s) 2021

Abstract
Search engines extract data from relevant sources and make them available to users via queries. A search engine typically
crawls the web to gather data, analyses and indexes it and provides some query mechanism to obtain ranked results. There
exist search engines for websites, images, code, etc., but the specific properties required to build a search engine for models
have not been explored much. In the previous work, we presented MAR, a search engine for models which has been designed
to support a query-by-example mechanism with fast response times and improved precision over simple text search engines.
The goal of MAR is to assist developers in the task of finding relevant models. In this paper, we report new developments
of MAR which are aimed at making it a useful and stable resource for the community. We present the crawling and analysis
architecture with which we have processed about 600,000 models. The indexing process is now incremental and a new index
for keyword-based search has been added. We have also added a web user interface intended to facilitate writing queries and
exploring the results. Finally, we have evaluated the indexing times, the response time and search precision using different
configurations. MAR has currently indexed over 500,000 valid models of different kinds, including Ecore meta-models,
BPMN diagrams, UML models and Petri nets. MAR is available at [Link]

Keywords Model repositories · Search engines · Model-driven engineering

1 Introduction thousands of models of different kinds. However, these ser-


vices do not provide model-specific features to facilitate the
The availability of mechanisms for effective navigation and discovery of relevant models.
retrieval of software models are essential for the success of In this setting, search engines are the proper tool for users
the MDE paradigm [16,26], since they have the potential to to be able to locate the relevant models for their tasks. For
foster communities of modellers, improve learning by letting example, Moogle [42] is a text-based search engine but it does
newcomers explore existing models and to discover high- not include model crawlers and it is discontinued. In [15], a
quality models which can be reused. There are several model query-by-example search engine specific for WebML mod-
repositories available [26], some of which offer public mod- els is presented. However, it only scales to a few hundreds
els. For instance, the GenMyModel1 cloud modelling service of models. Other search approaches are based on OCL-like
features a public repository with thousands of available mod- queries [11,38] which can be efficient but only select exact
els, although it does not support any search mechanism. matches. Another aspect is that existing approaches are not
Source code repositories like GitHub and GitLab also host useful in practice since they do not provide access to a high
number of diverse and public models.
1 Therefore, to date, there has been little success in creat-
[Link]
ing a generic and efficient search engine specially tailored
Communicated by S. Abrahão, E. Syriani, H. Sahraoui and J. de Lara. to the modelling domain and publicly available. Nowadays,
when a developer faces the task of finding models, she needs
B José Antonio Hernández López to perform several steps manually. First, she needs to find
joseantonio.hernandez6@[Link]
out herself potential sources for models (e.g. public model
Jesús Sánchez Cuadrado repositories). Then, for each source, a dedicated query must
jesusc@[Link]
be written (or a manual lookup must be done if no query
1 Facultad de Informática, Universidad de Murcia, Murcia, mechanism is available), and candidate results must be col-
Spain lected. The results are typically not ranked according to its

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


1716 J. A. H. López, J. S. Cuadrado

relevance with respect to the query. Which is more, when Organization. Section 2 motivates the need for this work and
results come from different sources only manual comparison presents an overview of our approach and the running exam-
is possible. Altogether, searching for models is still an open ple, and it introduces the user interface of MAR. Section 3
problem in the MDE community. presents the crawling and analysis architecture. Sections 4
In this paper, we present the evolution of MAR, a search and 5 explain keyword-based and example-based search,
engine specifically designed for models. MAR consists of respectively. Section 6 discusses the indexing process. Sec-
several components, namely crawlers, model analysers, an tion 7 reports the results of the evaluation. Finally, Sect. 8
indexer, a REST API and a web user interface. The crawlers discusses the related work and Sect. 9 concludes.
and the analysers allow MAR to obtain models from known
sources and prepare them to be consumed by the rest of the
components. The indexer is in charge of organizing the mod- 2 Motivation and overview
els to perform fast and precise searches. At the user level,
MAR provides two search modes: keyword based and by- In this section, we motivate the need for model-specific
example. Keyword-based search provides a convenient way search engines through a running example. From this, a set of
to quickly locate relevant models, but it does not take into challenges that should be tackled is derived. Then, we present
account the structure of models. In the query-by-example an overview of our approach.
mode, the user creates a model fragment using a regular mod-
elling tool or a dedicated user interface (e.g. textual editor) 2.1 Motivation
and the system compares such fragment against the crawled
models. This allows the search process to take into account As a running example, let us consider a scenario in which a
the structure of the models. To perform this task efficiently, developer is interested in creating a DSL which reflects the
MAR encodes models by computing paths of maximum, con- form of state machines, and thus, it will include concepts
figurable length n between objects and attribute values of the like state machine, state, transition, etc. The developer is
model. These paths are stored in an inverted index [7] (a interested in finding Ecore meta-models which can be useful
map from items to documents containing such item). Thus, either for direct reuse, for opportunistic reuse [33] (copy–
given a query, the system extracts the paths of the query and paste reuse) or just for learning and inspiration purposes.
access the inverted index to perform the scoring of the rel- There are several places in which the developer could
evant models. We have evaluated the precision of MAR for try to find meta-models. For instance, the ATL meta-model
finding relevant models by automatically deriving queries zoo contains about 300 meta-models,2 but it is only possi-
using mutation operators, and systematically evaluating the ble to look up models using the text search facility of the
configuration parameters in order to find out a good trade- web browser. Using this search mechanism the user could
off between precision and the number of considered paths. find up to five meta-models which seems to represent state
Moreover, we evaluate its performance for both indexing and machines. This is shown in the upper image of Fig. 1. How-
searching obtaining results that show its practical applicabil- ever, the only way to actually decide which one is closer to
ity. Finally, MAR has currently indexed more than 500, 000 what the developer had in mind is to open the meta-models
models from several sources, including Ecore meta-models, in Eclipse and inspect them.
UML models and BPMN models. MAR is freely accessible It is also possible to look up meta-models in GitHub. In this
at [Link] case, a possible approach to find relevant models is by con-
This work is based on [40], which has been extended as structing a specific search query. For instance, to search for
follows: Ecore meta-models about state machines one would write a
query like: EPackage state machine extension:ecore. This query
– An improved crawling and analysis architecture which will be matched against .ecore files which contain the key-
allows us to grow the number of visited models up to word EPackage as a hint to find Ecore/XMI formatted files,
600,000. plus the state machine keywords. This is shown in the lower
– A new indexing process which is faster and it is now image of Fig. 1. Matched files are not validated (i.e. files
incremental. An indexer for keyword-based search has may contain a bad format), duplicates are likely to exist, and
also been included. the user can only see the XMI representation of the file. In
– The UI and the REST APIs have been greatly improved, addition, the number of results can be relatively large, which
which are hopefully a better resource for the community. exacerbates the problems mentioned before, since the user
– Extended evaluation, which also includes the indexing needs to manually process the models.
process and an analysis of the effect of the configura-
tion parameters in the search precision and the response 2 [Link]
time. title=Ecore.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An efficient and scalable search engine for models 1717

AtlanMod
Meta-model
Zoo

Browser search

Query string

GitHub
search

Fig. 1 Search approaches for the AtlanMod meta-model zoo (upper part) and GitHub (lower part)

Therefore, there are a number of shortcomings that devel- an additional step of validating the models before using
opers face when trying to discover relevant models, namely: them.

– Heterogenous model sources. Models may be available In this setting, the lack of modelling platform able to
in several locations, which forces the user to remember address these issues has been the main driving force in the cre-
all these sources and manually check each one in turn. In ation of MAR. Its design, which is described next, is intended
contrast, a search engine provides a single entry point to to address these shortcomings.
discover models regardless of their locations.
– Model-specific queries. The unavailability of query mech- 2.2 Overview
anisms that target models specifically means that the user
needs to adapt the queries to the underlying platform and The design of a search engine for models must consider sev-
to give special interpretation to the results. eral dimensions [43]. One dimension is whether the search
– Ranking results. Results are typically not sorted using is meta-model based, in the sense that only models conform-
model-specific similarity criteria. In the best cases, key- ing to the meta-model of interest are retrieved. This is useful
words are used to drive the search. In some cases, there when the user wants to focus on a specific type of models and
is no sorting at all (e.g. in the AtlanMod Zoo). avoid irrelevant results. However, sometimes the user might
– Result inspection. The user would be interested in obtain- be interested in exploring all types of models as a first approx-
ing information about the model to make a decision about imation to the search. Another dimension is whether the
whether it is relevant or not. This information could be search engine is generic or specific for a modelling language.
specific renderings of the model (or parts thereof), a tex- There exist search engines for specific modelling languages
tual description or simple tags describing the topics of (e.g. UML [31] and WebML [15]), but the diversity of mod-
the model. elling approaches and the emergence of DSLs suggest that
– Model validation. When searching in source code repos- search engines should be generic. Approaches which rely on
itories, there is no guarantee that the obtained models exact matching are not adequate in this scenario since the user
are valid, which means that the user needs to perform is typically interested in obtaining results which are approx-

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


1718 J. A. H. López, J. S. Cuadrado

Query
imately similar to the query. Thus, the nature of the search hp://[Link]
State State
process requires an algorithm to perform inexact matching. Machine name: String
Web
Index

Moreover, a ranking mechanism to sort the results according Inial Final


frontend
REST
[Link] 12.3 or
to their relevance is needed. The ranking needs to take into

crawler
State State API
[Link] 10.7 IDE
account the similarity of the query with respect to each result. more… …

GenMyModel
A search engine requires an indexing mechanism for pro- Presentaon to the user Ranked list of results

cessing and storing the available models in a manner which


Fig. 2 Usage and components of the search engine
is adequate for efficient look up. In addition, a good search
engine should be able to handle large repositories of mod-
els while maintaining a good performance as well as search ticular, we have implemented a dedicated web interface
precision. Another aspect is how to present the results to the (Sect. 2.3).
user. This requires considering the integration with existing – Metadata and rendering. The system gathers metadata
tools and building dedicated web services. about the models and analyses them (e.g. how many ele-
Our design addresses the shortcomings discussed in the ments are there in this model?) in order to provide the user
previous section and dimensions described above by includ- with additional information to make decisions. Moreover,
ing the following features, which are depicted in Fig. 2. model-specific renderings are also provided to facilitate
the inspection of the models.

– Model-specific queries. The queries created by the user The availability of a platform of this type would allow
target the contents of the models indexed by the search modellers to discover relevant models for their tasks. We
engine. There are two query mechanisms available. foresee scenarios like the following:
– Keyword-based queries. In this search mode, the
user interacts with the system by typing a few key- – Model reuse. Users who want to find models useful for
words and obtaining models whose contents match their projects may use the query-by-example facility to
them (see Sect. 4). This is useful for quick searches provide a description of the structure that relevant models
but the results are less precise. need to have in order to obtain them.
– Query-by-example. In this search mode, the user – Novel modellers may be interested in navigating models
interacts with the system by creating an example of certain type to learn from other, possibly more expe-
model (see Sect. 5). This example model contains ele- rience, modellers.
ments that describe the expected results. The system – Researchers who want to easily gather models (with
receives the model to drive the search and to present metadata attached) in order to perform experiments (e.g.
a ranked list of results. machine learning, analytics, etc.).
– Large enterprises hosting several model repositories
– Generic search. The engine is generic in the sense that
could leverage on a search engine to make their stored
it can index and search models when their meta-model
models easily browsable (i.e. having a local installation
is known. In the case of the query-by-example mode, the
of MAR).
search is meta-model based because it takes into account
the structure given by the meta-model.
– Crawling and indexing. An inverted index is populated 2.3 Using MAR in practice
with models crawled from existing software repositories
(such as GitHub and GenMyModel) and datasets (such This section provides a summary of the usage of MAR,
as [20]). We use Lucene as our index for keyword-based intended to give the reader a concrete view of how the features
search (Sect. 4) and the HBase database as our backend outlined in the previous section are materialized in practice.
for example-based queries (Sect. 6). The goal is to facilitate the understanding of the subsequent
– Ranking procedure. The results are obtained by access- sections.
ing the index to collect the relevant models among those Following with the running example about searching for
available. Each model has a similarity score which is models related to state machines, we consider two different,
used to sort the results. The IDE or the web frontend is but complementary use cases.
responsible for presenting the results in a user-friendly
manner. – Keyword-based search. In this use case, the user wants
– Frontend. The functionality is exposed through a REST to skim available models for state machines by quickly
API which can be used directly or through some tool typing related keywords. For instance, in Fig. 3 (upper
exposing its capabilities in a user-friendly manner. In par- part) the user writes state machine as keywords. This

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An efficient and scalable search engine for models 1719

Quick Keyword-based search by uploading an XMI file or by using a concrete syntax. For
overview
state machine Ecore, we support Emfatic syntax 3 .
When the user clicks on the Submit! button, the search
service is invoked, and it returns the list of results. Each
Query-by-example
result 5 contains the file name (which is a link to the original
State
StateMachine file), a description (if available), a set of topics (if available),
Detailed states * name: String
search statistics (e.g. the number of model elements) and quality
indicators (e.g. quality smells [41]).
InialState FinalState Many times the number of results can be large. MAR pro-
vides a facet-based search facility 4 with which the user
Fig. 3 Keyboard-based search and query-by-example can filter out models by several characteristics, namely: the
number of quality smells, elements, topics or change the sort
order.
type of query is more likely to return irrelevant results Finally, to facilitate the inspection of models, a rendering
(e.g. the keyword state machine alone may also return is also available 6 .
models related to factories which use machines). How- In addition to the user interface, a REST endpoint is also
ever, they are easy to craft and to refine. An additional available. This can be useful to empower modelling tools.
advantage is that a keyword-based query can naturally We foresee scenarios in which the search results are used
be used to perform queries across model types because to provide recommendations in modelling editors (similar to
it is possible to build an efficient common index for all the approach proposed in [53]). For instance, when the user
models. This can be useful to explore for which mod- is editing a model, MAR could be called in the background
elling formalisms there are interesting models available. and the results can be used to recommend new elements
For instance, a developer might be interested in models to be added to the model, model fragments that could be
representing ATMs and she might discover useful UML copy-pasted, etc. We also envision usages of the REST API
state machines and activity diagrams, BPMN models and to take advantage of the query facilities in order to imple-
even Petri nets that model this domain. She may want to ment approaches to recover the architecture of MDE projects
reuse, e.g. UML models but use BPMN models and Petri [24]. For instance, given an artefact which is not explicitly
nets as inspiration to complete and refine the UML mod- typed (i.e. its meta-models are not known for some reason),
els. its actual meta-model could be discovered by performing
– Query-by-example. In this use case, the user has already queries with inferred versions of its meta-model.
in mind the model elements and wants the results to be
relatively close to a proposed example. Thus, in our sys-
tem the user would create a query by crafting an example
model which represents the main characteristics of the
results that the user expects. Fig. 3 (bottom part) shows a
model fragment in which the user indicates that they are 3 Crawling and analysis
interested in state machines in which Initial and Final
states are modelled as classes. It is worth noting that This section describes how MAR extracts models from avail-
this imposes a requirement on the obtained results. This able resources, and how they are analysed. From a practical
means that, for instance, models in which the initial and point of view, this is a key component of a search engine,
final states are modelled using an enumeration will have since it is in charge of obtaining and feeding the data that
a low score. will be indexed and searched. Moreover, it poses a number
of technical challenges which are also discussed in this sec-
tion.
To illustrate the usage of MAR, Fig. 4 shows the web-based
Figure 5 shows the architecture that we have built. It has
user interface as it is used to perform a query-by-example
three main parts: crawlers which collect models, analysers
search.3 In this case, the user selects that it is interested in
which take these models and validate, compute statistics and
crafting an example 1 and chooses the type of the model,
extract additional metadata from them, and indexers which
Ecore in this case 2 . Then, a model fragment or example that
are in charge of organizing the models so that they can be
servers as query must be provided. This can be accomplished
searched in a fast manner. The goal of this architecture is
to automate the crawling and analysis of models as much as
3 The interface for keyword-based search is similar. It is accessed by possible and to have components that can be run and evolved
clicking on the Keywords button. independently.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


1720 J. A. H. López, J. S. Cuadrado

2 Model type 1 Query-by-example 4 Facet-based filtering

5 Result

6 Model rendering

3 Model editor

Fig. 4 User interface of MAR

<XMI:xmi …
Crawled
<ecore:EPackage files
</XMI:xmi>

GenMyModel/UML
UML Analyser UML
GenMyModel
Crawler GenMyModel/Ecore
GenMyModel Indexers
Ecore Analyser Ecore
(MAR and Lucene)
GitHub
Crawler GitHub/Ecore
{ .xmi, BPMN Analyser BPMN
.ecore,
.bpmn } GitHub/BPMN Analysis
Databases
Crawler
Databases

Fig. 5 Crawling and analysis architecture

3.1 Model sources and new types of models as we discover them. An important
trait is that our search engine embeds all the knowledge that
The first challenge is the identification of model sources. we have gathered about model sources and how to process
Models can be found in different locations, like source code them, so that it provides an effective means for reusing this
repositories (e.g. GitHub, GitLab), dedicated model reposi- knowledge by simply searching for models.
tories (e.g. AtlanMod Zoo) or private repositories (e.g. within
an enterprise). There is no standard mechanism for dis-
covering these sources, and therefore this step is manual. 3.2 Crawling model sources
Moreover, many times there is an additional step of iden-
tification of relevant model types along with the required A search engine is more useful when it is able to provide
technology to process them. For instance, to process Archi- access to a greater amount of resources. Crawling is the
mate models we needed to gather its meta-model4 and process of gathering models from sources such as model
integrate it appropriately in our system. We have included in repositories, source code repositories and websites. Hence,
MAR those sources and types of models that we were already our crawlers are a key element for MAR. An important chal-
aware of, or that we have learned through our interaction lenge is that each repository or source of models has its own
with the MDE community. We plan to include new sources API or protocol for accessing its resources, or even no spe-
cific API at all (e.g. models listed in plain HTML pages). This
4 [Link] means that there is typically little reusability when imple-

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An efficient and scalable search engine for models 1721

Model current approach is to add a minimum and maximum size to


the search query, so that in each call we only obtain models
id: String Project
name: String within this range. Then, we move this sliding window adjust-
url: String 0..1 id: String ing the minimum and maximum sizes dynamically according
license: String name: String to the number of returned results. In practice, these limita-
project
creaon: Date url: String
tions imply that the GitHub crawler may take several days to
update: Date
topics: String[*] perform its job.
authors: String[*] Table 1 summarizes the types, number and sources of the
models that we have crawled so far. The highest number
Fig. 6 Data model of the crawlers of models corresponds to Ecore, UML and BPMN mod-
els. This is so as they are widely used notations. They have
been obtained from GitHub, GenMyModel and the Atlan-
menting crawlers, and for each type of repository a crawling Mod Zoo. We could also obtain a relatively large number
strategy is needed. of models from Archimate and Petri nets (in PNML format)
In our architecture, each crawler accesses the remote since they are also well-known types of models. We also
resource and downloads the models, which are stored tem- downloaded Sculptor models which is a textual DSL to gen-
porarily on a local disk. In addition, the metadata extracted erate applications. This shows that our system is not limited
for each model is stored in a database. In our design, for to XMI serialized files. Finally, we converted Simulink mod-
each pair of (r epositor y, meta − model) we have a differ- els using Massif [4] from a curated dataset [20].
ent database. This means that each crawler knows how to
extract a specific type of models (i.e. conforming to some
meta-model) from a given repository. 3.3 Model analysis
We have implemented three crawlers: for GitHub, Gen-
MyModel and the AtlanMod zoo. All crawlers maintain a The following step in the process is model analysis. It refers
database to store the metadata associated with each model. to the task of processing the gathered models, check their
The data model is shown in Fig. 6. validity and compute additional information such as statistics
For the AtlanMod meta-model zoo, we have implemented and quality measures.
an HTML crawler which extracts the links to the meta-models The task of checking the validity of a model is more
and the associated information (description, topics, etc.). involved than it might seem at first sight, notably if one is
For GenMyModel, we use its public API5 to download interested in a fault-tolerant process. The underlying problem
the metadata of the public models it stores. The metadata is that given a crawled model, blindly loading and validating
contains references to the location of the XMI files. The steps it may crash the analyser (i.e. out of memory error, bugs in
are as follows: EMF, invalid formats, etc.). Therefore, the implementation
must take this into account and load the model in a separate
– Download the complete metadata catalogue, which is process, so that it is possible to capture the situation that the
available as a collection of JSON files which can be easily process dies. If this is the case, the main execution thread can
downloaded from the API endpoint. catch the error and consider the model invalid.
– Download models per type using the metadata catalogue The analysis of a model consists of four main steps, and
which contain hyperlinks to the actual XMI files. the results of the analysis are stored in the data model shown
– Process the metadata files and insert the relevant metadata in Fig. 7.
into our crawler data model.

For GitHub, we use PyGithub6 to interact with the GitHub – Compute the MD5 hash of the file contents and check if it
Search API. The GitHub API imposes two main limits which is duplicated with respect to an already analysed model.
requires special treatment. First, there is a maximum of If so, mark its status as duplicated.
15,000 API Requests per hour (for OAuth users). Moreover, it – Check if the model can be loaded (e.g. using EMF) and
is recommended to pause for a few seconds between searches. check its validity (e.g. using EMF validators). If the
Secondly, it returns up to 1024 results per search. This means, model is valid, mark as such so that it can be later used
that we need to add special filtering options to make sure that in the indexing process.
each search call always returns fewer than 1024 models. Our – Compute statistics about the model. The most basic
measure is the number of model elements, but we also
5 [Link] compute other specific metrics for some model types like
6 [Link] Ecore, UML, etc.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


1722 J. A. H. López, J. S. Cuadrado

Table 1 Summary of the type of models crawled by MAR as September 30, 2022
Source Crawled Duplicates Failed Indexed Observations

Ecore GitHub 67,322 46,199 341 20,782 Based on the standard Ecore meta-model
GenMyModel 3987 3 27 3957
AtlanMod 304 1 4 299
UML GitHub 53,082 7282 1699 44,101 Based on the Eclipse UML implementation
GenMyModel 352,216 143 23,836 328,237
BPMN GenMyModel 21,285 0 200 21,085 Based on the EMF BPMN2 meta-model
Archimate GitHub 496 77 106 313 Based on the Archi meta-model
PNML GitHub 3291 1576 1044 671 Based on PNML framework
Sculptor GitHub 188 88 0 88 An application generator based on Xtext
RDS GenMyModel 91,411 108 515 90,788 Refers to entity/relationship diagrams
Meta-model semi-automatically inferred
Simulink Dataset [20] 200 0 0 200 Converted to EMF using Massif [4]
Total – 593,582 55,477 27,972 510,321 –
[Link]
[Link]
[Link]

EcoreStats adding some configuration options. Implementation-wise the


Stats
*
packages: int addition of a new model type to MAR involves the following
Model
numElems: int classes: int steps:
id: String …
origin: String
* Smell – Identify which modelling framework is needed to load the
name: String
url: String
name: String models. For instance, most models serialized as XMI can
license: String ocurrences: int be loaded with EMF if its Ecore meta-model is known.
creaon: Date Sometimes, the models are serialized using a textual syn-
update: Date
topics: String[*] tax, for instance using Xtext. In this case, the loading
phase requires invoking the corresponding Xtext facili-
Fig. 7 Data model of the analyser ties.
– A new analyser must be created, which is in charge of
loading the model, invoke the corresponding validator (if
– Perform a quality analysis. We currently analyse potential available) or implement new validation code. Moreover,
smells for Ecore models. For this, we have implemented the analyser may include the logic to compute metrics and
the smells presented in [41]. smells. The results are stored in the data model shown in
Fig. 7.
Table 1 summarizes the number of duplicated, failed and – The analyser must be registered in a global registry which
actually indexed files. contains pairs of (model type, factory object). Given a
certain model, its model type is looked up in the registry
3.4 Supporting different types of models and the factory object instantiates the validator.
– To support new types of visualizations, a similar approach
In MAR, we aim at supporting different types of models is used. Given a model type, a visualizer must be regis-
crawled from different sources. This is a challenge since it is tered. A visualizer takes a model, loads it and generates
not possible to analyse and process all types of models using an image. We currently target PlantUML, but other
a single approach. For instance, the code to load and analyse approaches are possible, like integrating the facilities pro-
PNML files is different from the one used to load and analyse vided by Picto [39].
Ecore models. Moreover, the computation of metrics, smells
and visualizations is specific to each kind of models. In general, adding a new model type is relatively straight-
To partially overcome this issue, the design of MAR is forward if it is based on EMF or an EMF-based framework
modular and extensible, in the sense that support for new (like Xtext) because MAR already provides facilities to imple-
model types can be added just by creating new modules and ment the above steps.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An efficient and scalable search engine for models 1723

3.5 Indexers state machin state state name


initi state final state
The models that have been deemed valid by the correspond-
ing analyser are fed into the indexers in order to store them 4. Storing of the bag of terms. The list of subterms is
in a form suitable for fast query resolutions. We have two transformed into pairs (ter m, f r eq) where f r eq is the
indexers: for keyword-based queries and for example-based frequency of ter m in the list. This step is done internally
queries. These search modes are explained in the two follow- by Lucene. For example:
ing sections.

(state,5) (machin,1) (name,1)


4 Keyword-based queries (initi,1) (final,1)

MAR provides a keyword-based search mode in which the Finally, this set of pairs are stored in the Lucene inverted
user just types a set of keywords and the system returns index.
models that contain these input keywords. To implement this
functionality, we rely on existing techniques related to full- Moreover, in each document we also include metadata
text search, which has been widely studied in the literature about the corresponding model obtained in the crawling pro-
[59]. In our context, we assume that the string attributes of cess, such as the description and topics. Then, the documents
the models that we want to expose in the search engine act are indexed in an inverted index managed by Lucene.
as keywords. Thus, we consider each model to be a docu- Regarding the search process, given a set of keywords as
ment consisting of a list of keywords. The expectation is that query, Lucene accesses the inverted index and it ranks the
a model contains names which reflects its intention and the models according to their scores with respect to the query (it
user can reach relevant models by devising keywords similar uses the Okapi BM25 [51,59] scoring function).
to those appearing in the models. A feature of our keyword-based index is that all models are
Implementation-wise, we use Apache Lucene [2] to indexed together. This means that a query like state machine
implement this functionality, since it provides facilities for may return models of different types like Ecore meta-models
pre-processing documents, to index the documents and to representing state machines, but also, e.g. UML models in
perform the keyword-based search. To integrate the crawled which the keywords may appear. Then, the user could use
models with Lucene, each model goes through the following the UI filters to select the concrete type of models.
pipeline: The main disadvantage of keyword-based queries is that
the structure of the models is not taken into account. Thus,
1. Term extraction. We extract all the string attributes a query like state transition could not distinguish whether
(e.g. [Link] in Ecore) from the model. For the user is interested in models containing classes named
instance, the text document for the model shown in Fig. 3 State or Transition, or if they refer to references like states
would be: or transitions. Thus, one way to improve the precision of the
search is by providing model fragments as queries so that the
StateMachine states State name structure of the models can be taken into account.
InitialState FinalState

2. Camel case tokenization. Each camel case term is


divided into lowercase subterms. For example, the pre-
5 Query-by-example
vious terms are transformed into:
MAR provides support for another search mode called query-
by-example. In this case, the user creates a small example
state machine states state name
model as the query and the system looks up models which
initial state final state
have parts similar to this example. In this section, we will
explain how MAR performs approximate structural matching
3. Word pre-processing. Full-text search engines, like
to determine relevant models. This is achieved in three steps:
Lucene, typically perform a pre-processing step which
include techniques like stop word removal (i.e. remove
common tokens like to and the) and stemming to reduce 1. Transforming each model into a multigraph.
inflectional forms to a common base form (e.g. transform- 2. Extracting paths from the graph.
ing a plural word like states into its singular form state). 3. Comparing the paths extracted from the query and the
Using the Porter Stemming Algorithm [49], we obtain: paths extracted from each model in the repository.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


1724 J. A. H. López, J. S. Cuadrado

5.1 Graph construction Data: input model, meta-model, C , R and A


Result: labelled multigraph
1 while not all model objects are visited do
In our approach we transform models into a labelled multi- 2 read an object o from the input model;
graph with the following form: 3 if o does not have a associated class node and the meta-class
of o belongs to C or it inherits from a meta-class of C then
add node v (associated with o ) labelled with class and the
G = (V , E, f , {μ1V , μ2V }, {μ E })
4
name of the meta-class;
5 forall the attribute of o do
where: 6 if attribute belongs to A then
7 add node w (associated with the attribute value);
8 label w with attribute and the value of the attribute;
– V is the set of vertices, E is the set of edges and f : 9 add edge (v, w) labelled with the name of the
E −→ V × V is a function which defines the source and attribute;
the target of each vertex. 10 add edge (w, v) labelled with the name of the
attribute;
– μ1V : V −→ {attribute,class} is a vertex labelling func- 11
tion that indicates if a vertex represents an attribute value 12 end
or the class name of an object. 13 forall the references of o to another object o do
– μ2V : V −→ L V where L V = L A ∪ L C . This is a vertex 14 if if the reference belongs to R then
15 if if o has not been visited then
labelling function that maps vertices to a finite vocabulary 16 add node u (associated with object o ), labelled it
which has two components: with class and its meta-class;
17 add (v, u) labelled with the reference name (u is the
– L A corresponds to the set of attribute values. If v ∈ V node associated with o );
is an attribute then μ2V (v) ∈ L A . 18

– L C corresponds to the set of names of the different 19 end


20 mark o as visited;
classes. If v ∈ V is a class then μ2V (v) ∈ L C .
21 end
– μ2E : E −→ L E where L E = L R ∪ L A R . This is an edge Algorithm 1: Procedure that transforms a model into a
labelling function which has two components. labelled multigraph.

– L R corresponds to the set of names of the differ-


name eStructuralFeatures name
ent references. An edge whose label belongs to this StateMachine EClass eContainingClass
EReference states
name
eClassifiers
ePackage

eType
vocabulary connects two classes. false
true
State
– L A R corresponds to the set of names of the differ- EPackage
false
EClass
eStructuralFeatures
eContainingClass
EAribute

eType
ent attributes. An edge whose label belongs to this ePackage eClassifiers

EClass EClass EDataType


vocabulary connects a class vertex and an attribute abstract name abstract name

vertex. InialState false FinalState false String

The procedure followed to transform a model into a multi- Fig. 8 Multigraph obtained from the model in Fig. 3. Object class nodes
graph is shown in Algorithm 1. This procedure receives a are depicted as rounded rectangles, while attribute nodes are depicted
as ovals
model as input, its meta-model and the set of all meta-classes
(C), references (R) and attributes (A) that will be taken into
account (i.e. those model elements whose meta-types are This graph is the interface of MAR with the different
not in these sets are filtered out). The output is a directed modelling technologies, that is, MAR can process any model
multigraph whose label languages L R , L A R and L C are deter-
that can be transformed to this graph. Up to now, we have
mined by R, A and C, respectively. In the output multigraph implemented the algorithm for EMF, but we foresee other
we can distinguish between two type of nodes: nodes labelled implementations for other types of models.
with class in μ1 (we call them object class nodes) and nodes
labelled with attribute in μ1 (we call them object attribute
nodes). For instance, the model in Fig. 3 is transformed into 5.2 Path extraction
the graph shown in Fig. 8 using Algorithm 1. In this example,
we consider as input for the algorithm: Once the graph associated with a model has been created, we
R ={all references} perform a path extraction process. The underlying idea is to
A = {[Link], [Link], use paths and subpaths found in the graph as the means to
[Link]} compare the example query against the models in the repos-
C = { ENamedElement } itory. In this context, the term bag of paths (BoP) refers to
the multiset of paths that have been extracted from a multi-

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An efficient and scalable search engine for models 1725

graph. A path that belongs to a BoP can have one of the 5.3 Scoring
following forms:
Given a query q (a model) and a repository of models M,
our goal is to compare each model of the repository with the
– Path of null length: (μ2V (v)) or
query. To achieve this, all the models in M are transformed
– Path of length greater than zero:
into a set of multigraphs and then paths are extracted obtain-
ing a set of bags of paths, BoPs = {BoP1 , . . . , BoPn }.
(μ2V (v1 ), μ2E (e1 ), . . . , μ2V (vn−1 ), μ2E (en−1 ), μ2V (n)), The query q follows the same pipeline and it is transformed
into BoPq . To perform the comparison between BoPq and
with n ≥ 2.  bag BoPi of BoPs, that is, to compute the score
each
r BoPq , BoPi , we will use the adapted version of the scor-
ing function Okapi BM25 [51,59] used in [40]:
This definition of BoP establishes the form of the ele-
ments that should belong to this multiset and does not enforce  
 n+1
which paths must be computed. Therefore, extraction crite- cω (q)(z+1)cω (i)
  2 log ,
|BoPi | d f (ω)
ria must be defined depending on the application domain. We 1 ω ∈ BoPq ∩ BoPi cω (i)+z 3 1 − b + b
avdl
follow the same criteria as our previous work [40] in which
only paths of these types are considered: (1)

– All paths of zero length for objects with no attributes. In where 1 ≤ i ≤ n, cω (q) is the number of times that a path
Fig. 8, the only object without attributes is the EPack- ω appears in a bag BoPq , cω (i) is the number of times that
age object (because we did not establish the package a path ω appears in a bag BoPi , d f (ω) is the number of
name). Therefore, the only example of this type of path bags in BoPs which have that path, avdl is the average of
 
is EPackage . the number of paths in all the BoPs in the repository and
– All paths of unit length (paths of length one) between z ∈ [0, +∞), b ∈ [0, 1] are hyperparameters. This function
attribute values and its associated object takes BoPi ∈ BoPs and BoPq , and returns a similarity score
 class. In Fig. 8,
examples of this type of paths are states , −−−→
name, (the higher, the better). We have chosen this scoring function
    −−−−−→ because it has three useful properties for our scenario:
EReference , State , −−−→ EClass ,
name, false , abstract,

EClass , etc.
– All simple paths with length less or equal than a threshold – It takes into account the paths that are in both bag of paths
(paths of length 3 or 4 are typical enough to encode the ( 1 box of Eq. (1)).
model structure, as determined empirically in Sect. 7) – It takes into account the paths that are very common in
between attributes and between attributes and objects the bags of the entire repository ( 2 box of Eq. (1)).
without – It penalizes models which have a large size ( 3 box of
 attributes. In Fig. 8, examples−−
of this type of paths
−−−−−−→
states , −−−→ EReference , containment,
are name, true
Eq. (1)) to prevent the effect caused by such models hav-
 ing significantly more matches. This is controlled by the
, StateMachine , − −−→ EClass , −
name,
−−−−−−−−−−−−→
eStructuralFeatures,
  hyperparameter b. If b → 1, larger models have more
−−−−−−−→
EReference ,− −−→ states
name, , EPackage , eClassifiers, penalization. We take b = 0.75 as default value.
 
−−−→
EClass ,name, InitialState , StateMachine , − −−→
name,
−−−−−−−−−−−−−→ −−−→
EClass , eStructuralFeatures, EReference , eType, EClass , −−−→
name, It is important to remark that the hyperparameter z controls

State , etc. how quickly an increase in the path occurrence frequency
results in path frequency saturation. The bigger z is, the
slower saturation we will have. We take z = 0.1 as default
The underlying idea is that paths of zero and unit length value.
encode the
 data of an object (its internal state). For instance, Therefore, to perform a query, for each element in BoPs
the path State , −−−→ EClass means that the model has an
name, its score with respect to BoPq is computed. Then, the repos-
EClass whose name is State. On the other hand, paths of length itory is sorted according to the computed scores and the top
greaterthan one encode the model structure. For instance, the k models are retrieved (the k models with highest score).
path −−−→ EClass , −
StateMachine , name,
−−−−−−−−−−−−→
eStructuralFeatures,
 While this is a straightforward approach, it is not efficient.
EReference ,−−−→
name, statesmeans that the query expects Instead, an index structure is needed in order to make the
the existence of a class named StateMachine with a reference scoring step scalable and efficient. Next section presents the
named states. implementation of this structure.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


1726 J. A. H. López, J. S. Cuadrado

6 Indexing models for query-by-example 6.2 Inverted index over HBase

In a typical text retrieval scenario, an indexer organizes the We use Apache HBase [1,30] to implement the inverted
documents in an appropriate data structure for providing a index. HBase is a sparse, distributed, persistent, multidimen-
fast response to the queries. In our case, which is a model sional, sorted, map database inspired by Google BigTable
retrieval scenario, the indexer organizes models instead of [19]. The HBase data model consists of tables, and each table
documents. As in every search engine, the indexing module has rows identified by row keys, each row has some columns
is crucial when we look for scalability and the management (qualifiers) and each column belongs to a column family and
of large repositories. has an associated value. A value can have one or multiple
versions identified by a timestamp. Therefore, given a table,
a particular value has 4 dimensions:
6.1 Inverted index
(Row, Column Family, Column Qualifier, Timestamp)
The main data structure used by the indexer is the inverted
−→ Value
index [59]. In a text retrieval context, the inverted index con-
sists of a large array in which each row is associated with a
Due to the design of HBase, the Column Family and
word in the global vocabulary and points to a list of docu-
Timestamp dimensions should not have a large number of
ments that contain this word.
distinct values (i.e. three or four distinct values). On the other
In our context, instead of words we have paths and instead
hand, there is no restriction in the Row and Column Qualifier
of documents we have models. This is a challenge that we
dimensions.
have to deal with since it causes two problems:
Regarding reading operations, HBase provides two oper-
ations: scan (iterate over all rows that satisfy a filter) and get
(random access by key to a particular row). With respect to
– Large amount of paths per query. In contrast to a text
writing, the main operation is put, to insert a new value given
retrieval scenario in which a query is composed by a
its dimensions (row, column family and column qualifier).
small number of keywords, in query-by-example, a small
This operation does not overwrite if multiple versions are
model/query can be composed by an order of hundreds
allowed.
of different paths.
Designing an HBase schema for an inverted index can be
– Huge inverted index. In a text retrieval scenario, the num-
relatively straightforward: each path is a row key and the
ber of words in the vocabulary has an order of hundred of
models that contain this path are columns whose qualifier
thousands, whereas, in model retrieval, we have an order
is the model identifier. However, this approach suffers from
of millions different paths.
one of the problems that we have explained previously: lots
of accesses (get) per query. Therefore, in order to avoid this
These two issues translate in practice into a lot of accesses problem, we use the schema presented in [40]. Each path is
per query to a huge inverted index. Therefore, the design of split into two parts: the prefix and the rest. The prefix is used
an appropriate and more complex structure beyond the basic as row key whereas the rest is used as column qualifier. The
inverted index is needed. The main ingredients that we use split point of the path depends on whether it starts with an
to tackle this are: attribute node (the prefix is the first sub-path of unit length)
or an object class node (the prefix is the first node). The value
associated with the column is a serialized map which asso-
– The inverted index is built over HBase [1]. This database ciates the identifier of the models which have this path with
has a distributed nature. Therefore, the table that will be a tuple composed by the number of times the path appears
associated with the inverted index is split into chunks and in the models and the total number of paths of the models.
they may be distributed over the nodes of a cluster. This This tuple is used by the system to compute the scores in an
design decision makes the system scalable with respect efficient manner. Therefore, this schema has the following
to the size of the index. form:
– Special HBase schema. We have designed an HBase
schema to accommodate an inverted index particularly (Prefix, Column Family, Rest, Timestamp) −→
oriented to resolve path-based queries. The underlying Map ModelID → (number of times, model size)
idea is to split the paths in two parts. Doing this, the
number of accesses per query to the inverted index is Thus, now a path is the concatenation of the prefix (row
reduced. This is explained next. id) and the rest (column qualifier). The column family is

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An efficient and scalable search engine for models 1727

CF Incremental mode
Prefix
Ecore Inverted Index Insertion Insertion Insertion Merge Insertion Insertion Insertion Merge ...
ROW KEY COLUMN FAMILY value
... ...
Fig. 10 Possible sequence of insertion and merge operations in the
(state, name, ) = {id1: [1,1032], id3: [1,891], id10: incremental mode. In this case, the number of distinct versions that the
Timestamp1
EClass [1,506], ...}
) = {id5: [1,280], id6: [1,891], id23:
column family allows is three
[1,678], ...}
Timestamp2

Rest
...

, abstract, false) = {id1: [1,1032], id3: 6.3 Indexing process


[1,891], id10: [1,506], ...}
, abstract, false) = {id5: [1,280], id6:
[1,891], id23: [1,678], ...} Given a large repository of models of the same type (i.e.
...
conforms to the same meta-model), indexing is the process
, eStrucFeat, EAttribute, name, of building the HBase table that represents the inverted index
name) = {id1: [1,1032], id23: [1,678],
id43: [1,268], ...} and follows the schema previously explained. Our search
... engine has two indexing modes:
... ...

– Batch: once the index is constructed, adding more models


to the index is not allowed. Instead, if new models are
Fig. 9 HBase schema of the inverted index discovered, the whole index needs to be deleted and all
the models are indexed together. To create an index in
batch mode, we process each model in the repository in
turn, transform each model into a graph, compute paths,
constant for all the paths in a table, and the number of different
group paths by prefix and invoke HBase put operations
timestamps is limited by the number of versions allowed in
to fill the index table.
the column family.
– Incremental: adding more models to the index is allowed.
This schema is illustrated in Fig. 9. As an example we use
We focus on this mode in the rest of the section.
three paths:
In order to implement the incremental mode the indexing
  process is composed by two subprocesses:
1. state , −−−→ EClass
name,
 
2. state , −−−→ EClass , −
name,
−−−−→
abstract, false – Insertion: it adds new models to the inverted index. In
 particular, this procedure receives as input a set of models,
3. state , − −−→ EClass , −
name,
−−−−−−−−−−−−→
eStructuralFeatures, EAttribute ,
 obtains its bags of paths and stores them in a HBase table.
−−−→ name
name, A detailed explanation is given in Sect 6.3.1.
– Merge: After a sequence of insertion operation has been
performed, the cells in the HBase table associated with
 the inverted index may have a set of versions. The merge
For instance, the second path state , −−−→ EClass ,
name, procedure is in charge of merging all these versions of a
−−−−−→ 
abstract, false is split into prefix (state, name, EClass (row cell into a single one. A detailed explanation is given in
key) and rest , abstract, false) (column qualifier). The first and Sect 6.3.2.
third paths have the same common prefix and thus share the
same row key. The column qualifiers associated with the row
key are the rest parts, that is, ), , abstract, false) and eStruc- 6.3.1 Insertion
turalFeature, EAttribute, name, name). Each column qualifier has
an associated value, which is a map whose keys are the model This subprocess is in charge of adding new models to the
identifiers (idx where x is a number) and whose values are the inverted index. It is implemented by a Spark [58] script which
pairs [a, b] where a is the number of times the path appears does the following (see Fig. 11):
in the idx model and b is total number of paths of the idx
model. Moreover, in this example, the paths have two ver- 1. Read and distribute all the models (this corresponds to
sions associated. label 1 in Fig. 11). Given a repository of n models, they
Altogether, this approach reduces the number of accesses are read and loaded in memory.
to the index because in a given model there are many paths 2. Obtain graphs, extract and process the paths (this corre-
with the same prefix, and thus with one access it is possible sponds to label 2 in Fig. 11). The set of loaded models
to retrieve all of them. {m 1 , . . . , m n } is transformed into the set {G1 , . . . , Gn }

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


1728 J. A. H. López, J. S. Cuadrado

2
1 3 4 5
2.1 2.2
...

...

... ...

... ... ... ... ... ... ...

Repository ... ...


of models

Fig. 11 Insertion subprocess

1 2 3
4 5

... ...
... ...

Fig. 12 Merge subprocess

using Algorithm 1 (label 2.1 in Fig. 11). After that, the 6.3.2 Merge
path extraction process is performed getting the set of
bags of paths {BoP1 , . . . , BoPn } (label 2.2 in Fig. 11). HBase is able to resolve read operations in the presence of
3. Emit key-value pairs. The key is the processed path and versions. However, the number of versions of a column fam-
the value is a tuple (model id,statistics) (label 3 ). For each ily is limited by an upper bound in order to control the size
BoP and for each distinct path that belongs to this BoP of the HFile (the distributed file that stores the column fam-
we emit a pair key-value with the form ( p, (id, (a, b))) ily). Therefore, it is necessary to perform a merge operation
where p is the path, id is the model identifier that the before the number of versions grows too large (i.e. in our
BoP is extracted from, a is the number of times the path implementation we set the maximum number of versions to
appears in the BoP and b is total number of paths of the four).
BoP. Here, p is the key and (id, (a, b)) is the value. According to our HBase schema, a value associated with
4. Group by key and merge the list of values (from the key- a version in a column will be a serialized map of models that
value pairs) into a serialized map (label 4 ). We group the contains a path (the path is given by the concatenation of row
pairs emitted in the previous step according to the path key and the column qualifier). Therefore, these versions can
(key) and the list of values associated with this key is be merged into a single one (i.e. these maps can be merged
transformed into a map of the form {id1 : (a1 , b1 ), id2 : into a single one). The merge subprocess is a Spark [58] script
(a2 , b2 ), ...}. For instance, as we can observe in Fig. 11, which does the following (Fig. 12):
the map associated with the p2 path has the elements
1 : (a p2 ,1 , b1 ) and 3 : (a p2 ,3 , b3 ) since BoP1 and BoP3
contain the path p2 .
5. Split the paths and store them in the HBase table (label 1. Read the table associated with the inverted index and
5 ). Each path is split into a prefix and the rest according extract all tuples (prefix, rest, timestamp, map) (label 1
to the HBase schema presented in Sect. 6.2 and it is stored in Fig. 12).
in the inverted index table with its associated map using 2. Ignore timestamp dimension and emit pairs key-value
the HBase put operation. (label 2 in Fig. 12). Here, the key is (prefix, rest) and
the value is the serialized map.
3. Group by the key, that is, coordinates (prefix, rest). This
This step is similar in both the batch and the incremental corresponds to step 3 in Fig. 12 in which (prefixi , resti )
modes. The only difference is that in incremental mode a are the distinct keys and each [map] is the list of maps
timestamp is attached to each inserted value to allow multiple associated with a key.
versions. This is possible because HBase allows multiple 4. Merge the list of maps associated with a key (prefix , rest )
versions in column families. Therefore, when a put operation and generate a new map. This step corresponds to the label
is executed and the key already exists the last value is not 4 in Fig. 12. Here, (prefixi , resti ) are the distinct keys and
removed, but a new version is added. mapi are the merged map.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An efficient and scalable search engine for models 1729

Ecore Inverted Index that others can install it locally and perform experiments with
ROW KEY COLUMN FAMILY it.
... ...

7.1 Indexing performance


(state, name, ) = {id1: [1,1032], id3: [1,891], id10:
EClass [1,506], id5: [1,280], id6: [1,891], id23:
[1,678] ...} As the number of crawled models grows, the performance
of the indexer module becomes increasingly important. We
...
have measured the time that the incremental and batch modes
take, and compared them. The experiments have been run in
, abstract, false) = {id1: [1,1032], id3:
[1,891], id10: [1,506], id5: [1,280], id6: an AMD Ryzen 7 3700X, 4.4Ghz, 8 core, 16 threads with 32
[1,891], id23: [1,678], ...}
GB of RAM. The Spark process is run locally with a driver
...
memory of 8GB.
To evaluate the effect of the repository size in the indexing
, eStrucFeat, EAttribute, name, time, we use the non-incremental mode and split a repository
name) = {id1: [1,1032], id23: [1,678],
id43: [1,268], ...} of 17.000 Ecore models in 17 “incremental” batches, that is,
a batch with the first 1000 models, a second batch with the
...
first 2000 models and a last batch with the complete 17.000
... ...
models. We have first shuffled the models to have a uni-
form distribution of the different sizes. We index each batch,
cleaning the database index between each run, and record the
Fig. 13 HBase table obtained after the application of the merge oper- indexing time. Fig. 14 shows the results. As the number of
ation in the table represented by Fig. 9 models increases, the indexing times grow linearly. The time
taken to index the full repository is about 12 minutes.
To evaluate the incremental mode we use the same reposi-
5. Store the result in a new HBase table (put operation) and
tory of 17.000 Ecore models, but this time, we create batches
delete the old one. This corresponds to the label 5 in
of 1000 models each. The experiment consists of indexing
Fig. 12.
each batch in turn, and after 4 insertions (5 when the index is
empty) we invoke the merge operation. We record the time
For example, if the merge operation is applied to the of both the insertion and merge operations. Figure 15 shows
HBase table represented in Fig. 9, the table represented by the results. The insertion times are in a range of about 40–75
Fig. 13 will be obtained. Looking at these two figures, we s per batch (the variations due to some models being much
can appreciate that maps associated with the same path but larger than others). The merge operation is more costly and
with different timestamps/versions have been merged into a it grows linearly with the size of the index.
single map. Comparing the total indexing time, the batch mode takes
To summarize, the batch mode is adequate when the man- 724 s (∼ 12 min), whereas the incremental mode takes 1410 s
aged repository is immutable (or very unlikely to change) or (∼ 23 min) when adding up all the insertion and merge times.
when it is small and it is very fast to re-index everything after Thus, there is a penalty when using the incremental mode due
a change. Instead, the incremental one is interesting when the to the merge operation. However, given that merge is only
repository may change (this is the case of GitHub or Gen- done from time to time, it pays off to use the incremental
MyModel) or when it is very large and it is better to index mode for large collections of models that require updates.
in chunks of models to avoid exhausting the available heap
memory. We evaluate the indexing time in both models in the 7.2 Search precision
following section.
We are interested in analysing the ability of MAR to match
and rank in the first positions relevant models, as well as the
7 Evaluation effect of the different parameters on this task (i.e. the path
length and the kind of meta-model elements considered in
In this section, we evaluate MAR from three perspectives. the graph construction). The evaluation of a search engine
First, we evaluate the indexing procedure in terms of its is known to be difficult since the notion of result relevance
performance, comparing batch mode and incremental mode. is subjective [59]. To address this issue, we follow the same
Second, we evaluate the search precision and finally we eval- idea proposed in [40], that is, we simulate a user who has in
uate the query response time. Moreover, the source code of mind a concrete model by taking a model from the repository
MAR is available at [Link] so and deriving its associated query using mutation operators.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


1730 J. A. H. López, J. S. Cuadrado

Table 2 Mutations operators used to simulate queries


600
Mutant Description
Time (seconds)

400
1 Extract connected subset Select a root element and pick
classes reachable via references or
subtype/supertype relationships, up
200 to a given length.
2 Remove inheritance Remove a random inheritance link
5000 10000 15000
(up to 20%).
Models
3 Remove leaf classes Remove classes, but prioritize those
which are farther from the root ele-
Fig. 14 Non-incremental mode performance. The x-axis indicates the
ment (up to 30%)
number of models that the process has to index and the y-axis the time
in seconds. MAR scales smoothly as the size of the index increases 4 Remove references Remove a random reference (up to
30%).
150 5 Remove enumeration Remove random enumerations or
literals (50%).
125
6 Remove attributes Remove a random attribute (up to
Time (seconds)

Operation 30%).
100
insertion
merge 7 Remove low-df classes Remove elements whose name is
75 “rare”.
8 Rename from cluster Replace a name by another name
50
corresponding to element belonging
to the same meta-model cluster (up
2º 00

3º 0 0

4º 00

5 º 00
1º 0 0 0

6º ge

7 º 00

8º 0 0

9 º 00
2º 0 0 0

1 0 rge

11 000

12 000

13 000

3º 0 0 0

1 4 rge

15 000

16 000

17 000

4º 0 0 0

ge
10

10

10

10

er
10

10

10

er
e

e
1

º1

º1

º1

º1

º1

º1

º1

º1
m

m

Stage to 30% of names).

Fig. 15 Incremental mode performance. The x-axis indicates the stages


and the y-axis the time in seconds. In case of merge operations, MAR
scales smoothly as the size of the index increases. In case of insertions, the names of the elements of the meta-models, in order to
the time used is constant since it only depends on the number of models group meta-models that belong to the same domain. In this
considered and their size way, this mutant attempts to apply meaningful renamings by
picking up names from other meta-models within the same
This query is a shrunk version of the original model which cluster. We apply this process with different radius configu-
includes some structural and naming changes, which is called rations (5, 6 and 7) and we discard mutants with less than 3
query mutant. The underlying idea is that, for each generated classes or with less references than |classes|/2. Using this
query mutant, we already know that the mutated model is strategy we generate 1595 queries.
precisely the model that we expect the search to return in the As a baseline to compare with, we use a pure text search
first position. This type of search in which there is only one engine. The full repository is translated to text documents
relevant model for each query is called known item search which contain the names of each meta-model (i.e. property
[59]. [Link]). These documents are indexed and we
We use the repository of 17.690 Ecore meta-models used associate the original meta-model to the document. Similarly,
in [40] and we reuse the mutation operators already imple- we generate the text counterparts of the mutant queries. On
mented for this previous work. We describe the operators here the other hand, we run MAR with 6 different configurations.
for the sake of completeness. For each meta-model, we apply In particular, two dimensions will be used: the maximum
in turn the mutation operators summarized in Table 2. First, path length and the attributes considered (that is, the set A
we identify a potential root class for the meta-model and we in Algorithm 1). We will consider the path lengths 2, 3 and
keep all classes within a given “radius” (i.e. the number of 4 and these two versions of A:
reference or supertype/subtype relationships which needs to
be traversed to reach a class from the root), and the rest are – Aall = {[Link], [Link], EReference.
removed. All EPackage elements are renamed to avoid any containment, [Link], ETypedElement.
naming bias. From this, we apply mutants to remove some lowerBound}
elements (mutations 2–6). Next, mutation #7 is intended to – Anames = {[Link]}
make the query more general by removing elements whose
name is very specific of this meta-model and it is almost never The sets R and C are fixed to {all references} and
used in other meta-models (in text retrieval terms the element {ENamedElement}, respectively. Therefore, a total of six con-
name has a low document frequency). Finally, to implement figurations will be taken into account in the evaluation
mutation #8 we have applied clustering (k-means), based on (Table 3). These six settings are chosen in order to study the

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An efficient and scalable search engine for models 1731

Table 3 The six configurations considered name n1 has this attribute of name n2 ( State , −−−→
name,
−−−−−−−−→ 
2 3 4 EClass , [Link]., EAttribute , −−−→
name
name, in Fig. 8).
Finally, Names-4 adds paths of the type this EClass of name
Aall All-2 All-3 All-4
n1 has a reference connected with an EClass of name n2
Anames Names-2 Names-3 Names-4
( StateMachine , −−−→ EClass , −
name,
−−−−−−−→
[Link]., EReference ,
−−−→ 
−−−→ , state
eType, EClass , name in Fig. 8).
Table 4 Precision evaluation results
On the other hand, when more attributes are considered
(All-x where x = 2, 3, 4), increasing the path length does not
MRR Differences in MRR
always improve the precision, since the configuration with
Text search 0.668 – length 4 is the one that achieves the worst results. In this
Names-2 0.734 < .001 / +0.066 case, an increase in the path length provokes an increase in
Names-3 0.742 < .001 / +0.074 the number of noise and useless
 paths. In Fig. 8, an example of
Names-4 0.757 < .001 / +0.089 this type of noisy paths is StateMachine , − −−→ EClass ,
name,
−−−−−−→ −−−−−−→ −−−−−→ 
All-2 0.742 < .001 / +0.089 ePackage, EPackage , eClassifier, EClass , abstract , false . This
All-3 0.752 < .001 / +0.084 path encodes the fact that the user is interested in models
All-4 0.702 < .001 / +0.034 with an EClass whose name is StateMachine and belongs to
an EPackage that contains a non-abstract EClass. The infor-
mation that this path encodes does not help in the search and
produces noise which may confuse the search engine.
effect of the path length (model structure considered) and the From this evaluation we can conclude that using some
set A (information of the model considered) in the precision. model structure in the search is beneficial in terms of preci-
The evaluation procedure is as follows. Each query mutant sion. As we have shown MAR, outperforms the text search in
has associated the original meta-model which it comes from. all configurations (all differences are positive and are statis-
Therefore, for each query, we perform the search and retrieve tically significant with p value < 0.001). In order to choose
the ranked list of results. We look up the original meta-model a proper configuration we need to find trade-off between pre-
in the ranked list and annotate its position r . Then, the recip- cision and efficiency since increasing the size of A and the
rocal rank (1/r ) is computed. Ideally, the best precision is maximum length causes an increase in the size of the index
achieved when the original meta-model is in the first posi- and, consequently, the time used to resolve a particular query.
tion. Once all queries have been resolved, we summarize the The effect on the response time is studied in the next section.
set of reciprocal ranks using the average obtaining the mean The main threat to validity in this experiment is that the
reciprocal rank (MRR). To see if the differences between derived mutants might not represent the queries that an actual
MAR and the baseline are statistically significant we use the t user would create. We have manually checked that they are
test. The results are shown in Table 4. The first column con- reasonably adequate, but an experiment with queries from
tains the MRR of each one of the configurations of MAR and real users is required to confirm the results. Regarding the
the text search engine (baseline). The second one contains selection of the configuration parameters, they are only valid
the differences in the mean and the p value when comparing for Ecore models. For other type of models, a similar analysis
with the baseline (for example, the cell < .001 / +0.066 asso- is required in order to select a proper configuration. Never-
ciated with Names-2 means that the difference in MRR when theless, for models in which names play an important role
comparing with Text search is 0.734 − 0.668 = +0.066 and (e.g. UML, BPMN) we expect configurations which rely on
it is statistically significant since the p value is < .001). name attributes to perform similarly well.
When only names are considered (Names-x where x =
2, 3, 4), the greater the length the better the precision. This 7.3 Query response time
is caused by the fact that increasing the path length produces
an increase in the amount of model structure considered. In To evaluate the performance of MAR we have used all mutant
particular,
 in Names-2 the majority
 of paths are of the form queries used in the search precision evaluation. We want to
of n ,− −−→ named element
name, (that is, the bags of paths measure the query response time using each one of the six
consider information of the form this model has a named ele- configurations introduced in the previous section. Thus, for
ment whose name is n). Names-3 considers all information each configuration, we have indexed the 17690 Ecore models
that Names-2 takes and adds interesting paths that encode and for each query mutant, we perform the search and mea-
facts of the type  this EClass of name n1 has this reference of sure the response time. Moreover, to evaluate the effect of the
name n2 (e.g. StateMachine , − −−→ EClass , −
name,
−−−−−−−→
[Link].,
 query size we count the number of packages, classes, struc-
EReference , −−−→
name, states in Fig. 8) or this EClass of tural features and enumerations and we classify the queries in

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


1732 J. A. H. López, J. S. Cuadrado

Table 5 Statistics of the mutants set used in the evaluation


EPackage EClass EAttribute EReference Total
Min Median Mean Max Min Median Mean Max Min Median Mean Max Min Median Mean Max Queries

Small 1 1 1.2 8 3 5 5.3 11 0 2 2.4 11 2 4 3.8 11 337


Medium 1 1 1.6 18 3 13 13.4 35 0 8 9.8 45 2 10 10.8 29 848
Large 1 1 3.6 46 3 27 37.25 193 1 29 35.7 168 2 24 30.9 116 410

Table 6 Mean and max statistics of the query response times (in sec- Table 7 Mean and max statistics of the number of paths
onds)
Small Medium Large
Small Medium Large Mean Max Mean Max Mean Max
Mean Max Mean Max Mean Max
All-2 66.0 195 178.7 892 437.1 1578
All-2 0.13 0.25 0.24 0.51 0.35 1.82 All-3 238.2 916 789.1 4337 2001.0 8600
All-3 0.29 1.02 0.66 2.07 0.97 3.63 All-4 619.7 4997 2849.0 21282 10340.0 48803
All-4 0.98 2.75 2.36 6.60 3.47 9.18 Names-2 22.6 94 65.5 284 172.0 552
Names-2 0.03 0.09 0.06 0.16 0.09 1.08 Names-3 91.6 544 366.1 2030 1059.4 5069
Names-3 0.05 0.21 0.11 0.41 0.19 1.64 Names-4 284.5 3687 1679.0 13108 7724.0 37474
Names-4 0.08 0.41 0.28 1.41 0.66 3.85

50000
7.5
Time (seconds)

40000

Size
5.0
30000 Size Small
Paths

Medium
Small Large
Medium
20000 Large
2.5

10000
0.0
0 All−2 All−3 All−4 Names−2 Names−3 Names−4
Configuration
All−2 All−3 All−4 Names−2 Names−3 Names−4
Configuration
Fig. 17 Boxplot of the query response time. The x-axis represents the
configurations considered, the y-axis the time in seconds and the colour
Fig. 16 Boxplot of the number of paths per query. The x-axis represents
the size of the queries
the configurations considered, the y-axis the number of paths of a query

and max statistics of the number of paths of the queries per


three types: small (less than 20 elements), medium (between configuration and query type. It can be observed that a change
20 and 70) and large (more than 70). Table 5 shows some in the MAR configuration of the form conf -x → conf -(x + 1)
details about the contents of the queries. (where conf is All or Names) and Names-x → All-x (with
Figure 17 shows for each configuration and for each query x = 1, 2 or 3) causes an increase in the number of paths for
size the query response time as a set of boxplots. Table 6 each one of the three sizes.
shows the mean and max. statistics of the query response Therefore, taking account the results in Tables 4 and 6
times per configuration and query size. a reasonable choice to consider part of the model structure,
Increasing the path length causes an increase in the num- have a good precision and still have a fast response time is
ber of distinct paths. On the other hand, adding elements to Names-4.
the set A implies generating bigger multigraphs (since more Finally, it is worth mentioning that category small cor-
nodes are considered). Consequently, increasing the set A responds to queries that users would normally perform. In
also causes an increase in the number of distinct paths. This this case, the average response time is less than a second in
fact has an impact on the size of the index, the number of all configurations (Table 6). On the other hand, for medium
accesses to the inverted index and the path extraction pro- and large mutants the average query response time is near 2
cess. In Fig. 16, the distribution of paths per query size and and 3 s, respectively, in the most demanding configuration
MAR configuration is shown and Table 7 contains the mean (All-4), which can be considered a good result since there

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An efficient and scalable search engine for models 1733

are queries with up to 193 classes. This suggests that MAR provide a search layer on top of third-party repositories from
could be applicable in scenarios like model clone detection. which models can be extracted.
MDEForge is a collaborative modelling platform [13]
intended to foster “modelling as a service” by providing facil-
8 Related work ities like model storage, search, clustering, workspaces, etc.
It is based on a megamodel to keep track of the relation-
In this section, we review works related to our proposal. We ships between the stored modelling artefacts. One goal of
organize these works in five categories: model encoding tech- MDEForge is to remove the need of installing desktop-based
niques, model repositories, mining and analysis of model tooling and to enable collaboration between users. As such,
repositories, search engines and model clone detection. the services offered by MDEForge are focused on its regis-
tered users and their resources. In contrast, MAR is intended
to offer a simpler and more open search service.
8.1 Model encoding techniques Img2UML [35] is a repository of UML models obtained
by collecting images of UML diagrams by searching for
The concept of BoP, which is a key element of our query- Google images. These images are converted to XMI files, and
by-example approach, is inspired by the notion of bag of there is a search system based on querying names of specific
path-context defined in Code2vec [6]. In this paper, the properties (e.g. class names). The use of this repository in a
authors study the transformation of code snippets into con- teaching environment has been studied in [36], showing its
tinuous distributed vectors by using a neural network with usefulness for improving students’ designs. MAR could be
an attention mechanism whose input is a multiset of paths used a similar setting, using its example-based facilities to
extracted from the AST of a code snippet. improve the search process.
SAMOS is a platform for model analytics. It is used in GenMyModel is a cloud service which provides online
[8] for clustering and in [10] for meta-model clone detection. editors to create different types of models and stores thou-
SAMOS’s approach is very close to ours since both transform sands of public models [3]. Our crawlers target GenMy-
models into graphs and extract some paths (the authors call Model, using the REST API it offers in order to provide
them n-grams). However, there are some differences between search capabilities for its models.
MAR and SAMOS. First, our graph is different as we con- ReMoDD is a repository of MDE artefacts of different
sider single attributes as independent nodes, and therefore nature, which is available through a web-based interface [29].
the extracted paths are different. And second, SAMOS is not It is intended to host case studies, rather than individual files.
designed to perform a search since it uses similarity metrics In addition, it does not include specific search facilities.
which are not thought to perform a fast search. Hawk provides an architecture to index models [11], typ-
AURORA [45] extracts tokens from Ecore models and ically for private projects and it can be queried with OCL.
uses a neural network to classify them in domains (e.g. This means that the query result will only contain models that
state machines, database, bibliography, etc.). These tokens satisfy the OCL expression. In contrast, the main use case of
are treated as words and the input of the neural network is MAR is inexact matching, which means that the resulting
a TF-IDF vector. In AURORA, paths are encoded by con- models do not necessarily match the complete query.
catenating tokens. MEMOCNN [46] can be considered the
evolution of AURORA. It uses an encoding schema simi- 8.3 Mining and analysis of model repositories
lar to AURORA but MEMOCNN transforms meta-models
into images and they are used to feed a convolutional neural There are some works devoted to crawl and analyse models.
network that performs the classification. ModelMine [50] is a tool to mine models from GitHub, along
Clarisó and Cabot propose the use of graph kernels for with additional information such as versioning. It is similar
clustering models [21]. The general idea is to transform mod- to our approach for crawling GitHub, except that our system
els into graphs and perform machine learning methods by uses file sizes instead of dates for filtering results. Another
using graph kernels [47] as similarity measure between graph difference, is that it does not provide specific capabilities
models. for validating the retrieved models. Similarly, Restmule is a
tool to help in the creation of REST APIs to mine data from
8.2 Model repositories public repositories [52]. We plan to use it and compare its
performance and reliability against our system.
The study performed by Di Rocco [26] shows that the search The work by Barriga et al. proposes a tool-chain to
facilities provided by existing model repositories are typi- analyse repositories of Ecore models [12]. To prove its use-
cally keyword based, tag based or there is no search facility fulness, they analyse a repository of 2420 ecore models. Their
at all. Our aim with MAR is to offer a search platform able to approach is similar to the analysis module of MAR. However,

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


1734 J. A. H. López, J. S. Cuadrado

Table 8 Summary of search engine for models approaches adapted and extended from [14]
Query format Search type Models Index Crawler Megamodel aware

[42] MOOGLE Text Text search Any Yes No No


[11] HAWK OCL-like language Exact results Any Yes No Partial
[15] Text version Text search Text search WebML Yes No No
[15] Structure-based version Query-by-example Structure based WebML No No No
[31] Query-by-example Structure based UML Yes No No
[38] MoScript OCL-like language Exact results Any No No Yes
[14] Text search Text search Any Yes No Yes
[28] Query-by-example Structure based BPMN No No No
[57] Query-by-example Structure based BPMN Yes No No
[18] Query-by-example Structure based Petri nets No No No
MAR Query-by-example Structure based Any Yes Yes No

our validators can handle models of different types (not only MOOGLE [42] is a generic search engine that performs
Ecore meta-models), they are fault-tolerant and we have used text search and the user can specify the type of the desired
them to analyse an order of magnitude more models. model element to be returned. MOOGLE uses the Apache
A statistical analysis is carried out in repositories of ∼ 450 Lucene query syntax and Apache SOLR as the backend
ecore models by Di Rocco et al. [25]. In particular, the search engine. In [14], a similar approach to MOOGLE is
authors studied relations between structural features of the proposed with some new features like megamodel aware-
meta-models (for instance, correlation between featureless ness (consideration of relations among different kinds of
meta-classes and number of meta-classes with a supertype) artefacts).
and their distributions (for instance, the distribution of iso- MoScript is proposed in [38], a DSL for querying and
lated classes). On the other hand, a similar analysis is carried manipulating model repositories. This approach is model-
out by the same authors in [27] in the context of ATL trans- independent, but the user has to type complex OCL-like
formations together with the source and target meta-models. queries that only retrieve exact models, not a ranking list
SAMOS is used to analyse and cluster the S.P.L.O.T. of models.
repository [9]. First, they try to get a high-level overview The work by Dijkman [28] investigates the application
of the repository by considering large clusters. Then, they of graph matching algorithms to ranking BPMN given an
carry out a more fine-grained clustering to discover model example as the query. After that, in [57], the greedy algo-
clones or duplicates. rithm (studied in [28]) is improved obtaining a procedure of
computing similarity between BPMN graphs ten times faster.
On the other hand, Cao [18] proposes to use the Hungarian
algorithm to query similar Petri nets.
8.4 Search engines In the work by Kalnina [34], a tool for finding model frag-
ments in a DSL tool is proposed. The fragments are expressed
A number of search engines for models have been proposed. using concrete syntax and a graph matching procedure is used
However, as far as we know, there is no one widely used. In as the search mechanism.
Table 8, the proposed search engines are shown along with Genetic algorithms with structural and syntactic metrics
an analysis of their features. were used to compare two meta-models in the work by
A search engine of WebML models is presented by Bis- Kessentini [37]. However, this approach is not intended to
limovska in [15]. It supports two types of queries: text queries be fast since it uses genetic algorithms.
and query-by-example. This search engine uses an inverted
index so the response is fast. On the other hand, when an
example is given as a query, the similarity between mod- 8.5 Model clone detection
els is computed by using a variation of the A∗ algorithm.
Therefore, the query’s response is slow since an iteration of It is worth including this topic in the related work section
the full repository is needed to obtain the final ranked list. because it can be seen as a type of approximate match-
On the other hand, the work by Gomes [31] focuses on the ing. Model clone detection is a relatively new topic [10]
retrieval of UML models using the combination of WordNet and much research has been devoted to deal with chal-
and Case-Based Reasoning. lenges such as scalability (models can be large graphs [22]),

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An efficient and scalable search engine for models 1735

tool-specific representations, internal identifiers, and abstract employee also returns models related to workers). Another
versus concrete syntaxes [10,54]. However, model clone line of work is the application of machine learning tech-
detection approaches have not been applied to search in large- niques that may help us automate some tasks like generating
scale repositories, since these techniques are more focussed descriptive tags, automatically classifying models in order
on pairwise model matching. In our case, we have a scor- to improve faceted search and to consider quality features in
ing algorithm that is global in the sense that model paths are the ranking algorithm by using a learning to rank approach
stored together in the index. This makes our approach more [17,32].
scalable in terms of the size of the repository.
In [10], SAMOS is applied to meta-model clone detection. Acknowledgements This work was funded by a Ramón y Cajal 2017
Grant (RYC-2017-237), funded by MINECO (Spain) and co-funded
A study of the UML clones is carried out in [54] and [55] by the European Social Fund. José Antonio Hernández López enjoys a
where a tool to detect clones (called MQlone ) is presented and FPU Grant funded by the Universidad de Murcia.
a description of clone types in UML is provided.
Exas encodes a graph model into a vector to perform Funding Open Access funding provided thanks to the CRUE-CSIC
agreement with Springer Nature.
approximate matching [44,48]. This technique has also been
applied to detect model transformation clones [56]. Open Access This article is licensed under a Creative Commons
On the other hand, clone detectors in Simulink models Attribution 4.0 International License, which permits use, sharing, adap-
have been studied in the literature. For instance, an approach tation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
of clone detection in data flow languages is presented in [23] source, provide a link to the Creative Commons licence, and indi-
which is based on graph theory and it is applied on model cate if changes were made. The images or other third party material
clone detection in Simulink. SIMONE (an extension of text- in this article are included in the article’s Creative Commons licence,
based tool NICAD) is presented in [5] to detect clones in unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your
Simulink models. intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copy-
right holder. To view a copy of this licence, visit [Link]
9 Conclusions [Link]/licenses/by/4.0/.

In this paper, we have presented MAR, a search engine specif-


ically designed for models. We have explained its crawler
and analysis architecture, with which we have processed References
600,000 models so far. We have implemented indexers for
1. Apache HBase. [Link]
both keyword-based and example-based queries. For the lat-
2. Apache Lucene. [Link]
ter, we report on the indexer design which includes both batch 3. GenMyModel. [Link]
and incremental modes. We have evaluated the performance 4. Massif: Matlab simulink integration framework for eclipse. https://
of MAR in terms of indexing and search time, obtaining good [Link]/viatra/massif
5. Alalfi, M.H., Cordy, J.R., Dean, T.R., Stephan, M., Stevenson,
results. Moreover, we have evaluated the search precision
A.: Models are code too: Near-miss clone detection for simulink
using mutation operators to simulate user queries. At the models. In: 2012 28th IEEE International Conference on Software
user level, MAR provides a web interface to perform queries Maintenance (ICSM), pp. 295–304. IEEE (2012)
and inspect the results. 6. Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning
distributed representations of code. Proc. ACM Program. Lang.
Regarding its practical usage, MAR is able to crawl mod-
3(POPL), 1–29 (2019)
els in several sources and it has currently indexed more 7. Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S.:
than 500,000 unique models of different kinds, including Searching the web. ACM Trans. Internet Technol. 1(1), 2–43 (2001)
Ecore meta-models, BPMN diagrams and UML models. This 8. Babur, Ö., Cleophas, L.: Using n-grams for the automated cluster-
ing of structural models. In: International Conference on Current
makes it a unique system in the MDE ecosystem, that we hope Trends in Theory and Practice of Informatics, pp. 510–524.
it is useful for the community. MAR is available at [Link] Springer (2017)
[Link]. 9. Babur, Ö., Cleophas, L., van den Brand, M.: Model analytics for
As future work, we plan to keep discovering new model feature models: case studies for splot repository. In: MODELS
Workshops, pp. 787–792 (2018)
sources and indexing more models. We want to include feed- 10. Babur, Ö., Cleophas, L., van den Brand, M.: Metamodel clone
back mechanisms for users to rate the results of the queries detection with samos. J. Comput. Lang. (2019)
and use this feedback to improve the system. Moreover, we 11. Barmpis, K., Kolovos, D.: Hawk: towards a scalable model index-
would like to address some limitations on the expressiveness ing architecture. In: Proceedings of the Workshop on Scalability in
Model Driven Engineering, pp. 1–9 (2013)
of the queries. For instance, it would be interesting to allow 12. Barriga, A., Di Ruscio, D., Iovino, L., Nguyen, P.T., Pierantonio,
encoding incomplete models (e.g. a UML class whose name A.: An extensible tool-chain for analyzing datasets of metamodels.
is a wildcard) and to consider synonyms (e.g. a search for In: Proceedings of the 23rd ACM/IEEE International Conference

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


1736 J. A. H. López, J. S. Cuadrado

on Model Driven Engineering Languages and Systems: Companion 31. Gomes, P., Pereira, F.C., Paiva, P., Seco, N., Carreiro, P., Ferreira,
Proceedings, pp. 1–8 (2020) J.L., Bento, C.: Using wordnet for case-based retrieval of UML
13. Basciani, F., Di Rocco, J., Di Ruscio, D., Di Salle, A., Iovino, models. AI Commun. 17(1), 13–23 (2004)
L., Pierantonio, A.: Mdeforge: an extensible web-based modeling 32. He, C., Wang, C., Zhong, Y.X., Li, R.F.: A survey on learning to
platform. In: CloudMDE@ MoDELS, pp. 66–75 (2014) rank. In: 2008 International Conference on Machine Learning and
14. Basciani, F., Di Rocco, J., Di Ruscio, D., Iovino, L., Pierantonio, Cybernetics, vol. 3, pp. 1734–1739. IEEE (2008)
A.: Exploring model repositories by means of megamodel-aware 33. Holmes, R., Walker, R.J.: Systematizing pragmatic software reuse.
search operators. In: MODELS Workshops, pp. 793–798 (2018) ACM Trans. Softw. Eng. Methodol. (TOSEM) 21(4), 1–44 (2013)
15. Bislimovska, B., Bozzon, A., Brambilla, M., Fraternali, P.: Textual 34. Kalnina, E., Sostaks, A.: Towards concrete syntax based find for
and content-based search in repositories of web application models. graphical domain specific languages. In: 2019 ACM/IEEE 22nd
ACM Trans. Web (TWEB) 8(2), 1–47 (2014) International Conference on Model Driven Engineering Languages
16. Bucchiarone, A., Cabot, J., Paige, R.F., Pierantonio, A.: Grand and Systems Companion (MODELS-C), pp. 236–242. IEEE (2019)
challenges in model-driven engineering: an analysis of the state 35. Karasneh, B., Chaudron, M.R.: Online Img2UML repository: an
of the research. Softw. Syst. Model. 1–9 (2020) online repository for UML models. In: EESSMOD@ MoDELS,
17. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamil- pp. 61–66. Citeseer (2013)
ton, N., Hullender, G.: Learning to rank using gradient descent. 36. Karasneh, B., Jolak, R., Chaudron, M.R.: Using examples for teach-
In: Proceedings of the 22nd international conference on Machine ing software design: an experiment using a repository of uml class
learning, pp. 89–96 (2005) diagrams. In: 2015 Asia-Pacific Software Engineering Conference
18. Cao, B., Wang, J., Fan, J., Yin, J., Dong, T.: Querying similar pro- (APSEC), pp. 261–268. IEEE (2015)
cess models based on the Hungarian algorithm. IEEE Trans. Serv. 37. Kessentini, M., Ouni, A., Langer, P., Wimmer, M., Bechikh, S.:
Comput. 10(1), 121–135 (2016) Search-based metamodel matching with structural and syntactic
19. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., measures. J. Syst. Softw. 97, 1–14 (2014)
Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a dis- 38. Kling, W., Jouault, F., Wagelaar, D., Brambilla, M., Cabot, J.:
tributed storage system for structured data. ACM Trans. Comput. Moscript: A dsl for querying and manipulating model repositories.
Syst. (TOCS) 26(2), 1–26 (2008) In: International Conference on Software Language Engineering,
20. Chowdhury, S.A., Varghese, L.S., Mohian, S., Johnson, T.T., Csall- pp. 180–200. Springer (2011)
ner, C.: A curated corpus of simulink models for model-based 39. Kolovos, D., De La Vega, A., Cooper, J.: Efficient generation of
empirical studies. In: 2018 IEEE/ACM 4th International Work- graphical model views via lazy model-to-text transformation. In:
shop on Software Engineering for Smart Cyber-Physical Systems Proceedings of the 23rd ACM/IEEE International Conference on
(SEsCPS), pp. 45–48. IEEE (2018) Model Driven Engineering Languages and Systems, pp. 12–23
21. Clarisó, R., Cabot, J.: Applying graph kernels to model-driven (2020)
engineering problems. In: Proceedings of the 1st International 40. López, J.A.H., Cuadrado, J.S.: Mar: A structure-based search
Workshop on Machine Learning and Software Engineering in Sym- engine for models. In: Proceedings of the 23rd ACM/IEEE Inter-
biosis, pp. 1–5 (2018) national Conference on Model Driven Engineering Languages and
22. Deissenboeck, F., Hummel, B., Juergens, E., Pfaehler, M., Schaetz, Systems, pp. 57–67 (2020)
B.: Model clone detection in practice. In: Proceedings of the 4th 41. López-Fernández, J.J., Guerra, E., De Lara, J.: Assessing the qual-
International Workshop on Software Clones, pp. 57–64 (2010) ity of meta-models. In: MoDeVVa@ MoDELS, pp. 3–12. Citeseer
23. Deissenboeck, F., Hummel, B., Jürgens, E., Schätz, B., Wag- (2014)
ner, S., Girard, J.F., Teuchert, S.: Clone detection in automotive 42. Lucrédio, D., Fortes, R.P., Whittle, J.: Moogle: A model search
model-based development. In: 2008 ACM/IEEE 30th International engine. In: International Conference on Model Driven Engineering
Conference on Software Engineering, pp. 603–612. IEEE (2008) Languages and Systems, pp. 296–310. Springer (2008)
24. Di Rocco, J., Di Ruscio, D., Härtel, J., Iovino, L., Lämmel, R., 43. Lucrédio, D., Fortes, R.P., Whittle, J.: MOOGLE: a metamodel-
Pierantonio, A.: Understanding mde projects: megamodels to the based model search engine. Softw. Syst. Model. 11(2), 183–208
rescue for architecture recovery. Softw. Syst. Model. 19(2), 401– (2012)
423 (2020) 44. Nguyen, H.A., Nguyen, T.T., Pham, N.H., Al-Kofahi, J.M.,
25. Di Rocco, J., Di Ruscio, D., Iovino, L., Pierantonio, A.: Mining Nguyen, T.N.: Accurate and efficient structural characteristic fea-
metrics for understanding metamodel characteristics. In: Proceed- ture extraction for clone detection. In: International Conference on
ings of the 6th International Workshop on Modeling in Software Fundamental Approaches to Software Engineering, pp. 440–455.
Engineering, pp. 55–60 (2014) Springer (2009)
26. Di Rocco, J., Di Ruscio, D., Iovino, L., Pierantonio, A.: Collabo- 45. Nguyen, P.T., Di Rocco, J., Di Ruscio, D., Pierantonio, A., Iovino,
rative repositories in model-driven engineering [software technol- L.: Automated classification of metamodel repositories: a machine
ogy]. IEEE Softw. 32(3), 28–34 (2015) learning approach. In: 2019 ACM/IEEE 22nd International Con-
27. Di Rocco, J., Di Ruscio, D., Iovino, L., Pierantonio, A.: Mining ference on Model Driven Engineering Languages and Systems
correlations of atl model transformation and metamodel metrics. (MODELS), pp. 272–282. IEEE (2019)
In: 2015 IEEE/ACM 7th International Workshop on Modeling in 46. Nguyen, P.T., Di Ruscio, D., Pierantonio, A., Di Rocco, J., Iovino,
Software Engineering, pp. 54–59. IEEE (2015) L.: Convolutional neural networks for enhanced classification
28. Dijkman, R., Dumas, M., García-Bañuelos, L.: Graph matching mechanisms of metamodels. J. Syst. Softw. 172, 110860 (2021)
algorithms for business process model similarity search. In: Inter- 47. Nikolentzos, G., Siglidis, G., Vazirgiannis, M.: Graph kernels: a
national Conference on Business Process Management, pp. 48–63. survey. arXiv preprint arXiv:1904.12218 (2019)
Springer (2009) 48. Pham, N.H., Nguyen, H.A., Nguyen, T.T., Al-Kofahi, J.M.,
29. France, R., Bieman, J., Cheng, B.H.: Repository for model Nguyen, T.N.: Complete and accurate clone detection in graph-
driven development (remodd). In: International Conference on based models. In: 2009 IEEE 31st International Conference on
Model Driven Engineering Languages and Systems, pp. 311–317. Software Engineering, pp. 276–286. IEEE (2009)
Springer (2006) 49. Porter, M.F.: An algorithm for suffix stripping. Program (1980)
30. George, L.: HBase: the definitive guide: random access to your 50. Reza, S.M., Badreddin, O., Rahad, K.: Modelmine: a tool to facili-
planet-size data. O’Reilly Media, Inc. (2011) tate mining models from open source repositories. In: Proceedings

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An efficient and scalable search engine for models 1737

of the 23rd ACM/IEEE International Conference on Model Driven José Antonio Hernández López
Engineering Languages and Systems: Companion Proceedings, pp. is a predoctoral researcher at the
1–5 (2020) University of Murcia. He obtained
51. Robertson, S., Zaragoza, H.: The Probabilistic Relevance Frame- a [Link]. in Mathematics, a [Link]. in
work: BM25 and Beyond. Now Publishers Inc, London (2009) Computer Science and a [Link]. in
52. Sanchez, B.A., Barmpis, K., Neubauer, P., Paige, R.F., Kolovos, Big Data from the University of
D.S.: Restmule: enabling resilient clients for remote apis. In: Pro- Murcia. Currently, he is enrolled
ceedings of the 15th International Conference on Mining Software in a Ph.D. programme in Com-
Repositories, pp. 537–541 (2018) puter Science supervised by Jesús
53. Stephan, M.: Towards a cognizant virtual software modeling assis- Sánchez Cuadrado. His research
tant using model clones. In: 2019 IEEE/ACM 41st International interests include model-driven
Conference on Software Engineering: New Ideas and Emerging engineering (MDE), recommender
Results (ICSE-NIER), pp. 21–24. IEEE (2019) systems and machine learning for
54. Störrle, H.: Towards clone detection in UML domain models. software engineering.
Softw. Syst. Model. 12(2), 307–329 (2013)
55. Störrle, H.: Effective and efficient model clone detection. In: Soft-
ware, Services, and Systems, pp. 440–457. Springer (2015) Jesús Sánchez Cuadrado is a
56. Strüber, D., Acreţoaie, V., Plöger, J.: Model clone detection for Ramón y Cajal researcher at the
rule-based model transformation languages. Softw. Syst. Model. Languages and Systems Depart-
18(2), 995–1016 (2019) ment of the University of Murcia.
57. Yan, Z., Dijkman, R., Grefen, P.: Fast business process similarity His research is focused on model-
search. Distrib. Parallel Databases 30(2), 105–144 (2012) driven engineering (MDE) topics,
58. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, notably model transformation lan-
A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, guages, meta-modelling and do
A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified main-specific languages. Lately, he
engine for big data processing. Commun. ACM 59(11), 56–65 has been involved in the construc-
(2016). [Link] tion of the MAR search engine
59. Zhai, C., Massung, S.: Text data management and analysis: a prac- ([Link] On these
tical introduction to information retrieval and text mining (2016) topics, he has published several
articles in journals and peer-revie
wed conferences, and developed
Publisher’s Note Springer Nature remains neutral with regard to juris- several open-source tools. His e-mail address is jesusc@[Link] and his
dictional claims in published maps and institutional affiliations. web-page is [Link]

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:

1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at

onlineservice@[Link]

You might also like