0% found this document useful (0 votes)
19 views15 pages

1 s2.0 S0957417416303141 Main

The document presents an ontology-based approach for integrating web analytics data in e-commerce, aimed at enhancing the analysis of customer behavior. It highlights the limitations of existing web analytics tools that often focus on low-level metrics and proposes a solution through the SME-Ecompass initiative, which provides accessible high-level analytics tools for SMEs. The study demonstrates the effectiveness of this approach by capturing and integrating data from Google Analytics and Piwik across various e-shops, resulting in enriched data that supports advanced data mining for customer behavior analysis.

Uploaded by

omidh2007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views15 pages

1 s2.0 S0957417416303141 Main

The document presents an ontology-based approach for integrating web analytics data in e-commerce, aimed at enhancing the analysis of customer behavior. It highlights the limitations of existing web analytics tools that often focus on low-level metrics and proposes a solution through the SME-Ecompass initiative, which provides accessible high-level analytics tools for SMEs. The study demonstrates the effectiveness of this approach by capturing and integrating data from Google Analytics and Piwik across various e-shops, resulting in enriched data that supports advanced data mining for customer behavior analysis.

Uploaded by

omidh2007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Expert Systems With Applications 63 (2016) 20–34

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

An ontology-based data integration approach for web analytics in


e-commerce
María del Mar Roldán García, José García-Nieto∗, José F. Aldana-Montes1
Dept. de Lenguajes y Ciencias de la Computación, University of Málaga, ETSI Informática, Campus de Teatinos, Málaga - 29071, Spain

a r t i c l e i n f o a b s t r a c t

Article history: Web analytics has emerged as one of the most important activities in e-commerce, since it allows com-
Received 15 December 2015 panies and e-merchants to track the behavior of customers when visiting their web sites. There exist a
Revised 15 June 2016
series of tools for web analytics that are used not only for tracking and measuring web traffic, but also
Accepted 16 June 2016
for analyzing the commercial activity. However, most of these tools focus on low level web attributes and
Available online 23 June 2016
metrics, making other sophisticated functionalities and analyses only available for commercial (non-free)
Keywords: versions.
Semantic model In this context, the SME-Ecompass European initiative aims at providing e-commerce SMEs with ac-
Ontology cessible tools for high level web analytics. These software facilities should use different sources of data
E-commerce coming from digital footprints allocated in e-shops, to fuse them together in a coherent way, and to
Web analytics make them available for advanced data mining procedures. This motivated us to propose in this work an
ontology-based approach to collect, integrate and store web analytics data, from many sources of popular
and commercial digital footprints. As article’s main impact, we obtain enriched and semantically anno-
tated data that is used to properly train an intelligent system, involving data mining procedures, for the
analysis of customer behavior in real e-commerce sites. In concrete, for the validation of our semantic ap-
proach, we have captured and integrated data from Google Analytics and Piwik digital footprints allocated
in 15 e-shops of different commercial sectors and countries (UK, Spain, Greece and Germany), through-
out several months of activity. The obtained results show different perspectives in customer’s behavior
analysis that go one step beyond the most popular web analytics tools in the current market.
© 2016 Elsevier Ltd. All rights reserved.

1. Introduction In the current market, there exist a series of tools for web an-
alytics, such as: Google Analytics, Piwik, Clicky, and StatCounter;
In the last few years, web analytics has emerged as one of the that are widely used not only for tracking and measuring web traf-
most important activities in e-commerce, since it allows companies fic, but also for analyzing the commercial activity, hence to im-
and e-merchants to track the behavior of customers when visiting prove the effectiveness of a website. However, these tools often fo-
their e-shop sites. Web analytic applications can also help compa- cus on low level and limited sets of web metrics and attributes,
nies to measure the results of traditional print or broadcast adver- without the possibility of providing specialized analyses. In most of
tising campaigns. Web analytics procedure is based on measuring cases, high level web metrics and sophisticated functionalities are
a visitor’s behavior once on a given e-shop site, which includes its available only for commercial (non-free) versions, which are rarely
drivers and conversions (to actual customer). These data are typi- accessible by SMEs or individual e-merchants.
cally compared against key performance indicators and used to im- In this context, the SME-Ecompass European initiative2 aims at
prove a website or marketing campaign’s audience response. providing e-commerce SMEs with accessible tools for high level
web analytics. These software facilities use the different sources of
data coming from different digital footprints allocated in e-shops.

Corresponding author. However, integrating data from multiple heterogeneous sources
E-mail addresses: [email protected] (M.d.M.R. García), [email protected], entails dealing with different data models, schema and query lan-
[email protected] (J. García-Nieto), [email protected] (J.F. Aldana-Montes). guages. Therefore, there is a clear demand of integrative proce-
1
This work is partially funded by FP7 EU project SME E-COMPASS under Grant
No: 315637. It is also partially funded by Grants TIN2014-58304 (Spanish Ministry
of Sciences and Innovation) and Regional projects P11-TIC-7529/P12-TIC-1519. Au-
thors thanks to involved e-shops to kindly offer web tracking data for testing and
2
validation. SME-Ecompass FP7 European initiative http://www.sme-ecompass.eu/

http://dx.doi.org/10.1016/j.eswa.2016.06.034
0957-4174/© 2016 Elsevier Ltd. All rights reserved.
M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34 21

dures for providing the advanced data mining algorithms with a 2.1. Background concepts
uniform access to multiple heterogeneous web data sources.
The main hypothesis in this work is: (H1) an ontology-based - Ontology. Ontologies provide a formal representation of the
integration approach will help us to collect, fuse the data to- real world, shared by a sufficient amount of users, by defining con-
gether in a coherent way, and store web analytics data, from cepts and relationships between them (Gruber, 1993). In computer
many sources of popular and commercial digital footprints. As a and information sciences, an ontology defines a set of represen-
result, (H2) we will obtain enriched and semantically annotated tational primitives with which to model a domain of knowledge
data that will be able to train data mining procedures for ad- or discourse. These primitives are typically concepts (classes), at-
vanced analysis of customer behavior in real e-commerce sites. tributes (properties), class members (class instances) and relation-
This motivated us to propose a semantic approach that uses an ships (property instances). The definitions of the representational
ontology as a mediated schema for the representation and consoli- primitives include information about their meaning and constraints
dation in a knowledge base of the tracking data from web source’s on their logically consistent application.
semantics. Semantic web ontologies become a key technology for Ontologies are part of the W3C standards stack for the semantic
intelligent knowledge processing, providing a framework for shar- web, in which they are used to specify standard conceptual vocab-
ing conceptual models about a domain. Semantic mappings be- ularies in which to exchange data between systems, provide ser-
tween the source schema and the ontology are then defined and vices for answering queries, publish reusable knowledge bases, and
used to transform the original data to RDF (Resource Descrip- offer services to facilitate interoperability across multiple, hetero-
tion Framework) 3 . This way, data from heterogeneous sources are geneous systems and databases.
stored and integrated inside a single RDF repository, which can be - RDF. Resource Description Framework is a basic ontology lan-
now easily queried by high level algorithms. The goal is to prop- guage used for representing information about resources on the
erly feed artificial intelligence procedures capable of deciding how web (Staab & Studer, 2009). Resources are described in terms of
to perform marketing activities, such as: displaying a given adver- properties and property values using RDF statements. Statements
tisement targeted to certain category of clients, or decreasing the are represented as triples, consisting of a subject, predicate and
price of a product in a given region; then giving rise to sophisti- object. RDF Schema (Staab & Studer, 2009) (RDFS) “semantically
cated expert systems for e-commerce applications. extends” RDF to enable us to talk about classes of resources, and
The main contributions of this study are summarized as fol- the properties that will be used with them. It does this by giv-
lows: ing particular meanings to certain RDF properties and resources.
RDFS provides the means to describe application specific RDF vo-
– We have developed a semantic approach for the data integra- cabularies. RDF and RDFS provide basic capabilities for describing
tion and consolidation of multiple web analytics data sources. vocabularies that describe resources, metadata and ontologies.
These data are daily accumulated from many heterogeneous - SPARQL. It is an RDF query language for ontology models
digital footprints allocated on actual e-shops. and databases, capable of extracting and manipulating informa-
– We have designed and implemented for the first time an OWL tion stored in RDF format. Essentially, SPARQL is a graph-matching
(Web Ontology Language) ontology (Dean & Schreiber, 2004) query language that can be used to extract knowledge from the
for web analytics. This ontology considers a large and comple- model such as the one proposed in this article. Given a data source
mented set of attributes and metrics, which have been token D, a query consists of a pattern, which is matched against D. The
from several representative web analytics tools in the market. combinations of values resulting from this matching constitute the
– To test hypothesis H1, we have captured and integrated data result of the query (Pérez, Arenas, & Gutierrez, 2009). SPARQL has
from Google Analytics and Piwik digital footprints allocated in strong support for querying semi-structured and tagged data, e.g.
15 e-shops of different commercial sectors (retail, tourism, elec- data with an unpredictable and unreliable structure. SPARQL sup-
tronics, pharmacy, etc.) and countries (UK, Spain, Greece and ports queries to networked, web data sources identified by URIs. In
Germany), throughout several months of activity. The data are fact, it is a W3C recommendation for RDF data.
integrated following the same (standard) format and stored in - OWL. In 2004, the W3C ontology working group (Dean &
a common RDF repository. Schreiber, 2004) proposed OWL as a semantic markup language for
– To test hypothesis H2, obtained “semantized” data are used to publishing and sharing ontologies on the World Wide Web. From
train advanced data mining algorithms to perform customer’s a formal point of view, OWL is equivalent to a very expressive de-
profile analyses. In particular, these algorithms are tested with scription logic where an ontology corresponds to a Tbox (Gruber,
success in two cases of study to classify the visitor’s behavior 1993). This equivalence allows the language to exploit descrip-
and product preference. tion logic researcher results. OWL extends RDF and RDFS. When
compared to RDF models, OWL adds more vocabulary for describ-
The remaining of this article is organized as follows. In
ing properties and classes: relations between classes (e.g. disjoint-
Section 2, background and literature overview are presented.
edness), cardinality (e.g. “exactly one”), equality, richer typing of
Section 3 presents the current state and practices in web analytics
properties, characteristics of properties (e.g. symmetry), and enu-
for e-commerce. In Section 4, the semantic approach is described,
merated classes (McGuinness & Harmelen, 2004).
giving details of the service architecture and the OWL ontology. Af-
- OWL-DL. Syntactic variant of the SHOIN (D) description logic
ter this, the validation procedure is reported in Section 5. Finally,
(Haase & Stojanovic, 2005) with a different terminology to OWL,
main conclusions and future work are given in Section 6.
which is based on RDFS, hence the support for data values, data
types and data type properties. OWL-DL restricts OWL into two
2. Background and related work distinct ways (Horrocks & Patel-Schneider, 2003): first, some syn-
tactic constructs like recursive descriptions in them are not al-
This section describes the main background concepts. A review lowed; second, classes, individuals and properties (respectively
of current related works in the specialized literature is carried out concepts, individuals and roles in description logics) must all
to point out their main differences with regards to our approach. be disjoint. In this work, we use OWL-DL syntax to formalize
the proposed ontology here for our semantic model. A summa-
rized description of basic OWL-DL semantics syntax is shown
3
RDF in W3C https://www.w3.org/RDF/ in Table 1, where an informal logic syntax is represented (left
22 M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34

Table 1 In contrast to other past proposals, we validate our seman-


Basic OWL-DL semantic syntax used to formally define the proposed
tic model with real data coming from digital footprints and web
ontology.
scraping methods in a number of actual e-shops. As a result, some
Descriptions Abstract syntax DL syntax of these e-commerce SMEs have put in practice the generated data
Operators intersection(C1 , C2 , , Cn ) C1 C2 Cn
and analyses, leading them to change and improve their commer-
union(C1 , C2 , , Cn ) C1 C2 Cn cial strategies.
Restrictions for at least 1 value V from C ∃V.C
for all values V from C ∀V.C 3. E-commerce web analytics: current practices
R is symmetric R ≡ R−
Class Axioms A partial(C1 , C2 , , Cn ) AC1 C2 Cn Previously to describe our semantic approach, we summarize
A complete(C1 , C2 , , Cn ) A ≡ C1 C2 Cn
in this section a series of activities we carried out in the context
of SME-Ecompass project, with the aim of shedding light on the
actual state of e-commerce companies.
column) with regards to the corresponding OWL-DL equivalent In a first phase, we delivered an online survey to e-shop’s own-
(right). ers (of associated chambers to SME-Ecompass project) with differ-
ent questions with regards to: their company’s profile, commerce
2.2. Related works activity, current practices when analyzing customer’s and competi-
tor’s behaviors, etc. After that, face-to-face interviews were also
In the last decade, a series of studies have been appearing that conducted with a selection of e-merchants in order to obtain de-
semantically model certain domains or sub-domains of knowledge tailed information of their professional experience in e-commerce.
in the context of e-commerce. A complete report of these questionnaires with the analysis of re-
A first attempt was proposed in Trastour, Bartolini, and Preist sponses, statistics and conclusions can be found in Garía-Nieto and
(2003), where a service description language was defined to be Roldán (2014). In concrete, more than 150 e-commerce SMEs com-
used throughout the life-cycle of a business-to-business (B2B) e- pleted the online surveys and 20 out of them were interviewed in
commerce interaction. In particular, they focused on DAML+OIL, as private sessions. The following conclusions were extracted:
it is a sufficiently expressive and flexible service description lan-
guage to be used not only in advertisements, but also in match- – Most of studied companies are micro enterprises with 1 to 5
making queries, negotiation proposals and agreements. employees and work in Business to Consumer (B2C) retailing
After this, Tamma et al. (Tamma, Phelps, Dickinson, & sector. Most of them have a maximum number of 50 0 0 orders
Wooldridge, 2005) designed an approach to negotiation activities per year (2013/2014) and a maximum annual revenue of 10,0 0 0
in e-commerce sites. In this work, the negotiation protocol does euros from online sales. Therefore, they are target candidates
not need to be hard-coded in agents, but it is represented by an to be beneficiary of automatic (free or non-expensive) tools to
ontology, in terms of an explicit and declarative representation of obtain advanced e-commerce analysis.
the negotiation protocol. The ontology is also used to tune agents – Concerning visitor’s behavior analyses (see Fig. 1 left), 47% of
strategies to the specific protocol used. companies do not use any tool for these kind of tasks. On the
A special case of e-commerce sites is e-tourism, for which War- contrary, a percentage of 29% declared that they use automatic
alak (Waralakv, 2008) discussed some ontological trends that sup- online tools and 22% make these tasks manually. Of course,
port the growing domain of online tourism. This study also gave most of them ( > 80%) declared to be quite interested on using
some example concepts of existing e-tourism using ontologies dis- a service to discover tendencies and common habits in clients.
play in graphical model and showed their descriptions in OWL and – Interestingly, as shown in Fig. 1 (right), it can be stood out that
RDFS syntax. Google Analytics is the most used tool of interviewed compa-
Hepp defined two related ontologies (Hepp, 2008): GoodRela- nies (68%), although they also use additional tools like Piwik
tions and Product Ontology. GoodRelations is a standardized vo- (16%) and other (16%). Therefore, the set of common metrics e-
cabulary for e-commerce (product, price, store, and company data) merchants usually analyze are those computed by Google Ana-
and the Product Ontology is an ontology for describing product lytics, e. g., number of visits, average visit time, geo-localization,
types based on Wikipedia. country, client device, etc. These metrics are usually checked
More recently, Gatchalee et al. (Gatchalee, Li, & Supnithi, 2013) weekly or monthly.
proposed an ontology approach to cover the knowledge about
the content and architecture of SMEs e-commerce websites. This In the light of these results, we decided to focus our seman-
knowledge is then used as input to a recommendation system for tic model on attributes and metrics provided by Google Analytics
web design, that centered on the structure of e-shops in Thailand and Piwik. We selected the former for being the most used an-
as sample group. alytic tool in the market. However, Google Analytics e-commerce
Finally, Akanbi (Akanbi, 2014) proposed LB2CO, an ontology advanced functionality is (to the date) not available for free users.
which combines the framework of IDEF5 & SNAP as an analysis This issue led us to complement our set of attributes with those
tool, for automated recommendation of product and services. This of Piwik, since this last tool offers free access to advanced e-
ontology is used to model a semantic framework for B2C transac- commerce attributes. Additional metrics for competitor and price
tions across different business domains that facilitates the interop- monitor are also considered in this work, which are gathered from
erability and integration of transactions over the web. specific web scrapping processes of SME E-Compass tool. It is
As summarized in Table 2, all these works proposed semantic worth noting that the proposed ontology is aimed at covering as
models focusing on different aspects in the domain of e-commerce, general as possible web analytic attributes, hence enabling the in-
such as: web contents, structure, and life-cycle activities. However, corporation of new analytics tools in our semantic approach.
to the best of our knowledge, there is still a lack of works where
a semantic model is used to consider web analytic attributes and 4. Semantic approach
metrics from multiple and heterogeneous sources of data. This is a
critical issue for current e-merchants that we try to cope, for the One of the main aims of this work is to capture, clean, consol-
first time, with our approach. idate and integrate data from different sources of web tracking in
M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34 23

Table 2
Related approaches in the state of the art. The target area of application, the used ontology/vocabulary and the post-
processing analysis, and the validation procedure are reported for each work.

Approach Target area Ontology/vocabulary Analysis Validation

Trastour et al. (2003) Life-cycle in B2B DAML+OIL - No


Tamma et al. (2005) Negotiation in e-commerce Negotiation protocol Agents system No
Waralakv (2008) E-tourism E-tourism concepts Recommender No
Hepp (2008) Products in e-commerce GoodRelations - Academic
Gatchalee et al. (2013) E-commerce sites design Website design structure Recommender Academic
Akanbi (2014) B2C transactions LB2CO Recommender Academic
Proposal Web analytics digital Web analytics Data mining Real
footprints (Google/Piwik) Ontology (WAO) Visitor/product world

Fig. 1. Current practices in surveyed e-commerce SMEs with regards to the use of automatic tools for web analytics.

e-commerce sites. For this reason, we opted to design a seman- e-shop owner is owner of an e-shop, a visitor makes visits, a
tic approach for sharing and reconciliation, whereby an agreed on- device has a browser, an IP address belongs to an organiza-
tology model is used to archive a common understanding of the tion, etc. Examples of data type properties are the title and
domain in which the system operates. In concrete, we have devel- URL of a page, the first and last name of an e-shop owner,
oped an OWL ontology to describe the e-shops main features by the version of the operating system, the duration of a visit,
following the standard ontology 101 development process (Natalya, etc. An object property is defined for each subclass to estab-
McGuinness, & Deborah, 2001) of seven steps: lish the correct relationship. For example, Page is related to
Bounce_rate and Date_of_last_visit; E-shop is related to Num-
(i) Determine the domain and scope of the ontology. As the start- ber_of_customers. Tables 3–8, describe in OWL-DL represen-
ing point, to limit the scope of the ontology, we selected tative subsets of object and data properties of a selection of
the kind of variables that the data mining algorithms need the main classes.
from Google Analytics and Piwik and also from competitors (vi) Define the facets of the slots. This step includes the defini-
e-shops, for instance: visitors origin, visitors attributes, pur- tion of cardinality constraints and value restrictions. Value
chasing behavior, product and customer details, etc. restrictions are used in our ontology to specify the data
(ii) Consider reusing existing ontologies. As we examined in type for the value in each subclass of the Analytic_parameters
Section 2.2, there are no similar ontologies that have been class. For example, the range of the property hasValue is re-
previously proposed for modeling web tracking data in e- stricted to float, when the class Bounce_rate is its domain;
commerce. However, we partially considered two related on- the range of the property hasValue is restricted to date, when
tologies: GoodRelations (Hepp, 2008), which is a standard- the class Date_of_last_visit is its domain.
ized vocabulary for e-commerce and the Product Ontology (vii) Create instances. Instances (individuals in OWL) correspond
(Hepp, 2008), which contextualizes product types based on to the specific data obtained from a specific e-shop. Individ-
Wikipedia. uals will be obtained by mapping the data from Google Ana-
(iii) Enumerate important terms in the ontology. Important terms lytics, Piwik or competitors e-shops to RDF according to the
in the ontology were extracted in a previous phase of re- ontology. Individuals can be also used in the ontology to de-
quirements specification (Garía-Nieto & Roldán, 2014) from fine the exact members of a class. The range of the property
the minimum set of variables that are needed. Exam- hasType is restricted to values: ”ASIN, ”EAN or ”ISBN, when
ples of such terms are: address, visitor, customer, device, its domain is Article_number. ASIN, EAN and ISBN are then
browser, Geographical_origin, Number_of_visitors, Conver- ontology individuals (see Table A.5 for further explanations).
sion_rate, etc.
(iv) Define classes and the class hierarchy. From the list of 4.1. Ontology model
terms, we obtained the ontology classes. Fig. 2 shows the
main set of classes in the hierarchy starting from the top The proposed ontology, called “wao.owl” (Web Analytics Ontol-
class Thing (). These main classes are related to other ogy), resulting from the development process described above has
classes and some of them have subclasses. For instance, a total of 62 classes (groups of individuals sharing the same at-
Analytic_parameters has a series of subclasses, such as: tributes), 61 object properties (binary relationships between indi-
Bounce_rate, Total_revenue, Number_of_returning_visitors, and viduals), and 67 data properties (individual attributes), 33 restric-
Number_of_transactions. tion axioms and 3 individuals. The complete ontology is available
(v) Define the properties of classes and slots. In order to relate in WebProtégé repository.4
classes and to define attributes, we identified objects and
data type properties based on the minimum set of variables
previously defined. Examples of object properties are: an 4
URL link http://stanford.io/1XHhHzr
24 M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34

Fig. 2. General overview of the WAO ontology. Continuous arrows refer to sub class of. Dotted arrows refer to specific properties.

Table 3
Analytics_parameters group: object and data properties.

Object properties Description logic

hasBrowser ∃ hasBrowser.Thing  Analytic_parameters  Device


  ∀ hasBrowser.Browser
hasCity ∃ hasCity.Thing Analytic_parameters  Location  Visitor
  ∀ hasCity.City
hasRegion ∃ hasRegion.Thing  Analytic_parameters  Location  Visitor
  ∀ hasRegion.Region
hasCountry ∃ hasCountry.Thing  Analytic_parameters  Location  Visitor
  ∀ hasCountry.Country
hasContinent ∃ hasContinent.Thing  Analytic_parameters  Location
  ∀ hasContinent.Continent
hasSource ∃ hasSource.Thing  Analytic_parameters
  ∀ hasSource.Source
Data Properties Description Logic
hasDate ∃ hasDate.DatatypeLiteral  Analytic_parameters  Price  Product_availability
  ∀ hasDate.DatatypedateTimeStamp
hasHour ∃ hasHour.DatatypeLiteral  Analytic_parameters
  ∀ hasHour.Datatypetime
hasNetworkDomain ∃ hasNetworkDomain.DatatypeLiteral  Analytic_parameters
  ∀ hasNetworkDomain.Datatypestring
hasValue ∃ hasValue.DatatypeLiteral  Analytic_parameters  Article_number  Price  Product_availability

For simplicity, we describe here a representative subset of main others: Average_order_value, Average_pages_visited_per_session, Av-
classes including some of their most interesting object and data erage_session_duration, Average_time_on_site, Bounce_ rate, Con-
properties. These classes are: Analytics_parameters, E-shop, Visitor, version_rate, Number_of_transactions, Number_of_landings, Num-
Page, and Item. Each class requires a set of properties or conditions ber_of_new_visitors, Number_of_page_views, Revenue_per_ session
in order to be conceptualized. That is, an individual that satisfies and Total_revenue. Table 3 shows some representative object and
those properties is considered to be a member of that class. data properties of Analytics_parameters. Each analytic parame-
- Analytics_parameters. Those attributes provided by Google ter belongs to a data type. For instance, the value of Num-
Analytics and Piwik that depend on time. Each analytic param- ber_of_transactions is an non-negative integer and the value of Con-
eter has a value (hasValue in Table 3), which corresponds to version_rate is a float. Data type restrictions are included in the on-
the data provided by the analytic tool, and a date (hasDate), tology by means of data properties.
which corresponds to the date when the data was obtained. - E-shop. An e-shop has one or several pages and also
Subclasses in the ontology ( Analytic_parameters) are, among an e-shop’s owner. Each e-shop’s owner has an address.
M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34 25

Table 4
E-shop group: object and data properties.

Object properties Description logic


hasVisitor > ≡ makesVisit >
∃ hasVisitor.Thing  E-shop
  ∀ hasVisitor.Visitor
hasNumberOfVisitors ∃ hasNumberOfVisitors.Thing  E-shop  Page
  ∀ hasNumberOfVisitors.Number_of_visitors
hasNumberOfVisits ∃ hasNumberOfVisits.Thing  E-shop  Page  Visitor
  ∀ hasNumberOfVisits.Number_of_visits
isOwnerOf ∃ isOwnerOf.Thing  E-shop_owner
  ∀ isOwnerOf.E-shop
Data properties Description logic
hasName ∃ hasName.DatatypeLiteral  Browser  Competitor  E-shop  Goal  Item
 Operating_system  Page  Product
  ∀ hasName.Datatypestring
hasURL ∃ hasURL.DatatypeLiteral  Competitor  E-shop  Page  Price
  ∀ hasURL.Datatypestring

Table 5
Visitor group: object and data properties.

Object properties Description logic

hasDevice ∃ hasDevice.Thing  Visitor


  ∀ hasDevice.Device
hasNumberOfVisits ∃ hasNumberOfVisits.Thing  E-shop  Page  Visitor
  ∀ hasNumberOfVisits.Number_of_visits
hasCity ∃ hasCity.Thing Analytic_parameters  Location  Visitor
  ∀ hasCity.City
makesVisit hasVisitor> ≡ makesVisit>−
∃ makesVisit.Thing  Visitor
  ∀ makesVisit.Visit
Data properties Description logic
hasDaysSinceFirstVisit ∃ hasDaysSinceFirstVisit.DatatypeLiteral  Visitor
  ∀ hasDaysSinceFirstVisit.DatatypenegativeInteger
hasDaysSinceLastOrder ∃ hasDaysSinceLastOrder.DatatypeLiteral  Visitor
  ∀ hasDaysSinceLastOrder.DatatypenegativeInteger
hasDaysSinceLastVisit ∃ hasDaysSinceLastVisit.DatatypeLiteral  Visitor
  ∀ hasDaysSinceLastVisit.DatatypenegativeInteger
∃ IsReturningVisitor.DatatypeLiteral  Visitor
  ∀ IsReturningVisitor.Datatypeboolean

Table 6 number_of_returning_visitors. All the analytic parameters related


Visit group: object and data properties.
to an e-shop are time dependent. Therefore, they are modeled
Object properties Description logic as classes and related to the e-shop by the corresponding object
property. Table 4 shows a subset of properties with classes in the
hasNavigationStep ∃ hasNavigationStep.Thing  Visit
  ∀ hasNavigationStep.Navigation_step
e-shop group as domain.
hasRefererKeyword ∃ hasRefererKeyword.Thing  Visit - Visitor and visit. Class visitor has two subclases: customer
  ∀ hasRefererKeyword.Referer_keyword and New_visitor. A customer is a visitor who makes a purchase.
makesVisit hasVisitor> ≡ makesVisit>− If it is the first purchase of this customer, he/she is a new cus-
∃ makesVisit.Thing  Visitor tomer. Customers have an address and name, whereas visitor do
  ∀ makesVisit.Visit
not. A visitor visits the e-shop by using a device. The analytic pa-
Data properties Description logic
rameters for visitors are: bounced_rate, number_of_visits, and num-
hasDuration ∃ hasDuration.DatatypeLiteral  Visit
  ∀ hasDuration.Datatypetime ber_of_visited_pages. The analytic parameters for customers are
hasReturningVisitor ∃ hasReturningVisitor.DatatypeLiteral  Visit number_of_transactions. Visitors visit pages.
  ∀ hasReturningVisitor.Datatypeboolean Visits are essential to capture the behavior of a visitor when
visiting the e-shop. A visit has an entry page and an exit page. It
also has a referrer page, which is the way the visitor has accessed
Attributes for the e-shop are latitude, longitude and time the site, i.e. search engine, social network, web-advertisement etc.
zone. The e-shop’s owner can have competitors, who are e- If the referrer page is a search engine, the keywords used to find
shop’s owners of other e-shops. The analytic parameters of the site are also associated with the visit. A visit has a given du-
an e-shop are: average_order_value, average_pages_visited_per_ ration. The attributes of a visit are the times when the entry page
session, average_ session_duration, average_time_on_site, conver- and the end page were accessed, duration, back link, whether or
sion_rate, date_of_last_transaction, number_of_customers, number_ not an order was placed and the total goals converted during the
of_failed_transactions, number_of_successful_transactions, number_of_ visit, number of actions, number of events and number of searches.
new_customers, number_of_new_visitors, number_of_sessions_ During a visit, transactions are made. A visit follows a path which
by_medium, number_of_transactions, number_of_unique_visitors, has a next page, a previous page and a number. The class Navi-
number_of_units_sold, number_of_visitors, number_of_visits, per- gation_step is used to model the path that the user follows from
centage_of_new_sessions, revenue_per_session, total_revenue, and the entry page to the exit page. Each navigation step has only one
26 M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34

Table 7
Page group: object and data properties.

Object properties Description logic

hasNumberOfVisits ∃ hasNumberOfVisits.Thing  E-shop  Page  Visitor


  ∀ hasNumberOfVisits.Number_of_visits
hasNumberOfVisitors ∃ hasNumberOfVisitors.Thing  E-shop  Page
  ∀ hasNumberOfVisitors.Number_of_visitors
hasTotalRevenue ∃ hasTotalRevenue.Thing  E-shop  Page
  ∀ hasTotalRevenue.Total_revenue
isOnPage ∃ isOnPage.Thing  Item
  ∀ isOnPage.Page
Data properties Description logic
hasName ∃ hasName.DatatypeLiteral  Browser  Competitor  E-shop  Goal  Item
 Operating_system  Page  Product
  ∀ hasName.Datatypestring
hasURL ∃ hasURL.DatatypeLiteral  Competitor  E-shop  Page  Price
  ∀ hasURL.Datatypestring
hasTitle ∃ hasTitle.DatatypeLiteral  Page
  ∀ hasTitle.Datatypestring

Table 8
Item group: object and data properties.

Object properties Description logic

hasItem ∃ hasItem.Thing  Page


  ∀ hasItem.Item
hasPrice ∃ hasPrice.Thing  Item  ShareProductData
  ∀ hasPrice.Price
includes ∃ includes.Thing  Order
  ∀ includes.Item
isOnPage ∃ isOnPage.Thing  Item
  ∀ isOnPage.Page
Data properties Description logic
hasCategory ∃ hasCategory.DatatypeLiteral  Item
  ∀ hasCategory.Datatypestring
hasName ∃ hasName.DatatypeLiteral  Browser  Competitor  E-shop  Goal  Item
 Operating_system  Page  Product
  ∀ hasName.Datatypestring
hasItemID ∃ hasItemID.DatatypeLiteral  Item
  ∀ hasItemID.DatatypenonNegativeInteger
hasQuantity ∃ hasQuantity.DatatypeLiteral  Item
  ∀ hasQuantity.DatatypenonNegativeInteger

attribute number. Tables 5 and 6 show the properties with classes are: name, type, availability on a specific date and article number.
in the visitor and visit group as domain, respectively. The article number can be “ASIN”, “EAN” or “ISBN”.
- Page. Pages contain items, i.e. product and/or services to be
sold. The analytic parameters for Page are: average_order_value,
4.2. Data sources: mapping and querying
average_time_on_page, bounce_rate, date_ of_last_ visit, num-
ber_of_exits, number_of_landings, number_of_new_visitors, number_
As we explained in Section 3, we have focused on three main
of_page_views, number_of_returning_visitors, number_of_sessions_
sources of data coming from different web tracking methods,
by_medium (mediums are direct link, social media and search
namely: Google Analytics, Piwik, and specific web scrapping meth-
engine), number_of_sessions, number_of_unique_page_views, num-
ods in the scope of SME E-Compass project.
ber_of_unique_visitors, number_of_units_sold, number_of_visitors,
The process of translating the collected data from different
number_of_visits, revenue_ per_session_and_total_revenue. Attributes
sources to RFD is carried out by means of mapping functions. Each
of page are title and URL. A series of representative properties
data source has a different set of methods to gather, harmonize,
whose domain is page are shown in Table 7. Interestingly, we can
store and provide access to the analytical data. Therefore, a dif-
observe in this table that the property hasTotalRevenue is related
ferent set of mapping functions is required to parse the informa-
to the Page, as well as the to the whole e-shop, as this value can
tion coming from each data source to RDF, according to the on-
be calculated for both classes.
tology. Fig. 3 illustrates an general overview of the mapping pro-
- Item. As commented before, an Item is a product or a service
cess to store data from different sources in a common RDF reposi-
which is sold in an e-shop. Specific items of an e-shop are modeled
tory. Each set of mappings is then composed by functions to trans-
by defining a domain ontology for a specific domain, i.e., travel,
late the attributes with their values into their corresponding triple
books, music, etc. Table 8 contains some representative object and
form in RDF. In fact, for most of the attributes, a corresponding
data properties of class item. According to this, Items have a price
mapping function has been developed, involving its correspond-
(hasPrice). Prices are valid on a certain date. Therefore, attributes
ing class in the ontology. Nevertheless, as a number of analytic
for prices are value, currency and the date for the price validity.
attributes shares a common structure in the ontology, they have
The attributes of Items are category and whether or not it has been
been mapped by using generic functions, hence taking advantage
deleted. Products have a manufacturer. The attributes for products
of the ontology’s design.
M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34 27

Fig. 3. General overview of the mapping process injecting data from different sources into the RDF repository.

4.2.1. Google Analytics sion tracking, event tracking, geolocation, pages transitions views
Google Analytics5 is a partially free web analytics service that and page overlay.
provides statistics and basic analytical tools for Search Engine Op- Similarly to Google Analytics, the web tracking procedure in
timization (SEO) and marketing purposes. The service is available Piwik is also performed by a digital footprint script, that is al-
to anyone with a Google account, although advanced e-commerce located in the e-shops HTML source code. In the case of Piwik,
functionalities are only available for restricted users. Google Ana- the analytical data is automatically stored in a relational database
lytics is geared toward small and medium-sized retail websites. (SQL). Therefore, as we have the possibility to access to this rela-
The web tracking procedure in Google Analytics is performed tional database, we have developed the mapping functions to di-
by a “snippet” or digital footprint component, that provides the rectly query the analytic attributes. These attributes are described
developer with an API of functions for accessing to each attribute in Tables A.2–A.4 of Appendix A with regards to their correspond-
value. This digital footprint is a small piece of JavaScript code that ing ontology classes. The obtained data is then translated to RDF
is pasted into the e-shops HTML source code and deployed in the according to the ontology by means of specific mapping methods,
web server where the e-shop is hosted. It activates Google Ana- as shown in Fig. 3.
lytics tracking by inserting the JavaScript ga.js/analytics.js
into the page. As illustrated in Fig. 3, the JavaScript component is 4.2.3. Web scrapping methods
then instantiated by our mapping functions by means of a series In the scope of the SME E-Compass project, there exist a series
of java classes to generate RDF triples. of methods for scraping product and price data from the competi-
Table A.1 in Appendix A contains the set of Google Analytics tors e-shop websites. This way, a given e-shop’s owner is able to
attributes that are currently tracked by our semantic approach. In compare their products’ prices with those ones of their direct com-
this table, each attribute is listed with regards to its corresponding petitors automatically.
ontology class, data type, and description. This is a representative This specific functionality provides a REST API service from with
subset of the whole set of possible attributes (and its combina- we can obtain attributes of competitor’s profile in JSON7 format,
tions) in the Google Analytics’s API specification, that covers all our which is a compact and easily readable data format for the purpose
preliminary requirements for visitor’s behavior and products’ anal- of data exchange. Table A.5 contains the competitors attributes that
ysis. However, it is worth mentioning that the proposed ontology are mapped to RDF in our semantic model (see Fig. 3), with re-
can be easily extended to consider any of the attributes worked gards to the corresponding ontology classes.
with Google Analytics.

4.2.2. Piwik 4.2.4. RDF repository


Piwik 6 is a free and open source web analytics application run- Finally, an RDF repository is used to integrate the analytic data
ning on a PHP/MySQL web server. Piwik tracks online visits to one collected and mapped from the different sources. Therefore, by
or more websites and displays reports on these visits for analysis. means of an SPARQL endpoint, it is now possible to query the an-
Piwik analytic features include, among others: real-time data up- alytic data unambiguously and independently of the source.
dates, free advanced e-commerce analytics features, goal conver- As an instance of data access, let us consider an scenario in
which the analytic module requires information concerning the

5
http://www.google.com/analytics/
6 7
http://piwik.org/ http://json.org
28 M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34

Fig. 4. Example of SPARQL query that returns disaggregated data attributes, as the ones provided by Piwik, as well as calculated metrics, as those obtained from Google
Analytics.

Table 9 with a short duration finished without any conversion, which rep-
Two samples of the query result (Fig. 4) of a certain time slot
resents a visitor that leaves the site prematurely.
(day 2015-10-23) of a real e-shop.
In the case of aggregative attributes, they are calculated for
Attribute/metric Visit75688 Visit75692 all the visits in the time period of the SPARQL query. Therefore,
timestamp 14:19:44 14:21:41
as shown in the second half of Table 9, the e-shop registered a
visit_total_searches 0 0 bounce rate close to 53% with conversion rate8 of 34.12%, that cor-
visit_total_events 0 0 responds to all visits, bounces and purchases of the queries time
visit_total_duration 2071 12 period.
visit_total_goal_converted 1 0
Another important attribute is the number of new visitors, that
total_bounce_rate 52.6066 for this e-shop and for this date is 145, e. g., 68.72% of total entries.
total_conversion_rate 34.1232
This information could be now used to feed predictive algorithms
total_number_of_entries 211
total_number_of_new_visitors 145 that help the e-merchant to adopt a given marketing strategy to
catch clients.
In order to automatize and simplify the accesses to the stored
data, our semantic approach includes a specific REST API service
with methods that implement predefined SPARQL queries. These
visits of a given e-shop, in a certain date or period of time. The re- methods are used as input of the data mining algorithms as de-
quired information of visits should consist of both: disaggregated scribed in the following step of validation.
data attributes, as the ones provided by Piwik, and calculated met- As an additional advantage of this semantic approach, it is pos-
rics, as those obtained from Google Analytics. sible to connect our RDF repository with other/s external open
The SPARQL query represented in Fig. 4 unifies the encoding of linked data repository/ies. In this regard, a minimum adaptation
such logic, for which a couple of result samples are displayed in has to be done in terms of deciding which class/classes are directly
Table 9. In concrete, these results correspond to two consecutive linked from the two repositories with similar semantic meaning. In
visits to the e-shop with ID <eshop-id>, that were performed fact, this is one of the most powerful features when using the se-
at date 2015-10-23. The visit IDs are 75688 and 75692, and they mantic structure induced by the ontology.
were captured at timestamps 14:19:44 and 14:21:41, respectively.
As shown in this table, the visit with a prolonged duration led to
one goal conversion (usually a successful sale), whereas the visit 8
Conversion rate: proportion of visitors converted into paying customers.
M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34 29

Fig. 5. Percentage of visitors classified by typologies (misplaced, loyal and wandering) for country of origin. The percentages are relative to each country.

5. Validation bounces; in combination with dimensions: region, source, net-


workdomain, browser, hour, date, and city. These attributes are
For the validation of our semantic approach, we have captured modeled in our ontology model as specified in Section 4.1 and in
and integrated data from digital footprints (Google Analytics and Table A.1.
Piwik) allocated in more than 15 e-shops of different commer- The following charts (Figs. 5–7) show analytical visualizations
cial sectors (retail, tourism, electronics, pharmacy, etc.) and coun- of visitor profiler provided by the SME E-Compass application for a
tries (UK, Spain, Greece and Germany), throughout several months given e-shop (Spanish). In these figures, the distribution of prede-
of activity. The resulting approach provides the data mining algo- fined types are plotted depending on variables such as the origin
rithms with a large and complemented set of web attributes and (continent, country, region, city, etc.) or time (in a range of dates).
metrics, that enable them to perform advanced analyses. In concrete, Fig. 5 shows the percentage of loyal, misplaced and
In particular, we focus in this section on two different cases of wandering visitors to the analyzed e-shop, in a time period (from
study, for which we perform analyses regarding the visitor behav- 2015-08-15 to 2015-09-15), for each country of origin. We can ob-
ior and the product profile. Both kind of analyses entail a series serve in this figure that 100% of visitors from Israel, Madagascar,
of unsupervised learning procedures to classify visitors and prod- Japan, Switzerland, Mexico, and Peru are misplaced, whereas in the
ucts into different predefined types. First, a clustering algorithm is case of China and Argentina, all visitors are loyal. For other coun-
performed to determine groups of visitors/products. After this, a tries like Denmark, Dominican Republic, France, Germany, Nether-
decision tree is generated with rules to assign samples in clusters lands, Sweden, United Kingdom, and Spain, the proportion of mis-
to predefined types. Finally, a classification procedure is used to placed and loyal visitors are balanced. Wandering visitors are de-
assign the new incoming data to each predefined type. tected in the cases of France, Spain, and Ireland.
More in depth, if we focus on the global distribution of per-
5.1. Case study I: visitor behavior centages per regions for a given typology of visitors, as shown in
Fig. 6, it is clearly observable that the highest percentage of mis-
The behavioral patterns concerning to visitors have been clas- placed visitors are from the Spanish region of Valencia. In spite
sified into 3 types (misplaced, loyal, wandering) according to the of existing regions of other countries (different to Spain) that reg-
features that they contain in common, such as length of visit, istered low percentages of misplaced visitors, e. g., England (UK),
shopping number, whether they are recurrent regarding their vis- State of Sao Paulo (Brazil), Ile-de-France (France), it is worth not-
its/shopping on the web, or not, etc. ing that the remaining regions of other countries registered global
percentages lower than 1%, and therefore they are not registered in
– Loyal: The loyal visitors visit more pages than the average vis- sector chart of Fig. 6. Therefore, focusing on this classification and
itor, navigate the site frequently and make purchases more of- origin study, a given e-merchant might be interested in looking at
ten. the generated impact by a marketing campaign on those regions
– Misplaced: This visitor stays in the site for a very short period with the aim of converting misplaced visitors to loyal ones.
of time and visits a small amount of pages. They rarely make In terms of time evolution, we can also focus on a specific range
transactions. of chronology and check whether a customer loyalty strategy is
– Wandering: The wandering visitor stays in the site for quite a able to obtain improvements in the overall sales or not, over the
long period of time and visits an amount of pages close to the time. In this regards Fig. 7 plots evolution lines with respect to the
“loyal” visitor. However, (s)he is usually new and makes less three types of classified visitors. In this figure, it is easily observ-
purchases than the loyal visitor. able that misplaced visitors evolve generally close to loyal ones,
This kind of analysis is generated from a series of Google An- although we can inspect, by selecting a specific timeframe (e. g.
alytics attributes, such as: users, entrances, exits, pageviews, from 2015-08-15 to 2015-09-15) which days the number of loyal
uniquepageviews, sessionduration, newusers, sessions, and is higher than the number of misplaced visitors, in order to find
30 M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34

Fig. 6. Sectors chart focused by ”misplaced” visitor typology and region.

Fig. 7. Visitor typologies over time. The plot below shows the activity in a range of three months from 2015-06-30 to 2015-10-01, whereas the plot above is a specific
timeframe selected for one month from 2015-08-15 to 2015-09-15.

any insight of why this happened, e. g., whether it is the effect of


a marketing campaign launched during this time period, or not.

5.2. Case study II: product profiler

The intrinsic idea in product analysis is to show evidences


about the relationships between visitors actions and purchased
products, in order to guide the e-merchant to improve its benefits
by updating the products visibility and their prices.
This kind of analysis requires information concerning, not only
the navigation activity of visitors, but also the e-commerce or pur-
chasing habits. Therefore, we now focus on captured data from
Piwik attributes, such as: idaction_sku (Stock-keeping unit), idac-
tion_name, idaction_category, location_geoip, visit_first_action_time,
visitor_days_since_order, visit_goal_buyer, and visit_goal_converted.
These attributes are modeled in our ontology model as specified
in Section 4.1 and in Tables A.2–A.4.
In this regard, Fig. 8 shows the relationships between product
conversions and visitors. It consists of a SWOT (strengths, weak-
nesses, opportunities and threats) diagram in which, horizontal
axis of provides an indication of the amount of visitors for a prod-
uct within the e-shop, whereas vertical axis shows the conversion Fig. 8. Product SWOT analysis.
M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34 31

data from heterogeneous sources are stored and integrated inside


a single RDF repository, which can be now easily queried by high
level algorithms.
The main hypothesis guiding us is that: (H1) an ontology-
based integration approach will help us to collect, fuse the data
together in a coherent way, and store web analytics data, from
many sources of popular and commercial digital footprints. In
Fig. 9. Trending products. order to test it, we have designed and implemented for the first
time an OWL (Web Ontology Language) Ontology for web analytics.
This ontology considers a large and complemented set of attributes
rate of a product. The size of points represents the total revenue and metrics, which are given from several representative web an-
generated by a particular product. Therefore, for each product in alytics tools in the market, as well as other specific attributes of
a given e-shop, a different strategy can be assessed according to e-commerce analysis, such as the competitor’s behavior and prices
the 4 quadrants in this plot where it is located. This way, products monitors.
with low visitors and low conversions show little interest from vis- In this regard, we have conducted a series of analyses to val-
itors, as well as few sales. Profitable products are located in Low idate our semantic approach. We have captured and integrated
Visitors-High Conversions quadrant, having little web traffic ori- data from Google Analytics and Piwik digital footprints allocated
ented to it. With a proper promotion, this product could be more in 15 e-shops of different commercial sectors (retail, tourism, elec-
visited and create even more sales. High Visitors-Low Conversions tronics, pharmacy, etc.) and countries (UK, Spain, Greece and Ger-
products might be interesting but somehow visitors are not totally many), throughout several months of activity. The data are inte-
convinced to buy it. These products are candidate to revise web grated following the same (standard) format, stored in a common
positioning, features, price, etc. Finally, High Visitors-High Conver- RDF repository.
sions and attractive and profitable products, that could be success- As secondary hypothesis, we expect (H2) to obtain enriched
ful by themselves. and semantically annotated data able to successfully train data
In general, it might be desirable to check if products are behav- mining procedures for advanced analysis of customer behav-
ing as expected, regarding their position in the chart. For this spe- ior in real e-commerce sites. The obtained “semantized” data are
cific case of study, we are using data from footprints allocated in used to train advanced data mining algorithms to perform cus-
an e-shop of cosmetic and hairdresser products, which products in tomer’s profile analyses. In particular, these algorithms are tested
High Visitors-High Conversions quadrant are: Wella SP Luxe Sham- with success in two cases of study to classify the visitor’s behav-
poo 1000 ml and Wella Professional Repair 400 ml. These preferences ior and product preference. The resulting approach provides data
of customers correspond to the text cloud tool as shown in Fig. 9, mining algorithms with a large, complemented, and well-grounded
where the e-shop’s owner can rapidly view the trending product set of web attributes and metrics that enable them to perform ad-
characteristics. vanced analyses.
Finally, it is worth saying that without integration data from As future work, we plan to include new kinds of analytic foot-
many sources, such results could not be obtained, since they are prints to our semantic approach. This probably entails new updates
computed by using data queried for specific visitor’s features with of our ontology model to consider further attributes for advanced
complementary attributes, as shown in Section 4.2.4. In particular, web analytics. We are also interested in incorporating open linked
for the cases of study worked here, they use aggregated and dis- data to enrich our semantic model with new perspectives of infor-
aggregated attributes that are jointly provided by Google Analytics mation, such as: meteorological information, product descriptions
and Piwik, respectively. and/or social sector’s affinities.

6. Conclusions
Appendix A. Analytic metrics and attributes
In this work, we propose a semantic approach that uses an on-
tology as a mediated schema for the representation and consoli- The complete set of used attributes and metrics from
dation of the tracking data from web source’s semantics. Semantic Google Analytics, Piwik and Scrapping methods are described in
mappings between the source schema and the ontology are then Tables A.1–A.5. The corresponding ontology class of each attribute
defined and used to transform the original data to RDF. In this way, are located in the first column of these tables.

Table A.1
Google Analytics used metrics and attributes in the ontology model.

Ontology class Attribute Type Description

E-shop transactions/sessions float∗ Number of transactions divided by number of visits


Page sessions int∗ Number of sessions
Page entrances int∗ Number of sessions starting at this page
Page exits int∗ Number of sessions ending at this page
Page pageviews int∗ Number of page views
Page uniquepageviews int∗ Number of unique page views
Page bouncerate float∗ Number of sessions with just one page view
Visit sessionduration int∗ Session duration in seconds
E-shop avgsessionduration float∗ Average session duration in seconds
E-shop percentNewSessions float∗ Percentage of new sessions
E-shop pageviewspersession float∗ Average number of pages visited per session
E-shop, Page users int∗ Number of unique users/visitors
E-shop, Page newusers int∗ Number of new visitors
E-shop, Page users newusers int∗ Number of returning visitors
(continued on next page)
32 M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34

Table A.1 (continued)

Ontology class Attribute Type Description


E-shop transactions int Number of e-commerce transactions
E-shop, Page itemRevenue float∗ Total e-commerce revenue
E-shop, Page itemQuantity int∗ Total number of units sold
E-shop, Page transactionRevenuePerSession float∗ E-commerce revenue per Session
E-shop, Page revenuePerTransaction float∗ Average order value
E-shop, Page bounces int∗ Total number of single page (or single engagement hit) sessions
Visit uniquePurchases int∗ Number of product sets purchased

All these metrics are combined with dimensions: date, hour, city, region, browser, networkDomain, and source.
Besides, sessions are combined with dimensions: city, region, country and continent.

Table A.2
Piwik used metrics and attributes in the ontology model.

Ontology class Attribute Type Description

Address location_geoip_latitude decimal (7,4) Latitude where the visitor lives


location_geoip_longitude decimal (7,4) Longitude where the visitor lives
location_geoip_city varchar (100) City where the visitor lives
location_geoip_region varchar (2) Region where the visitor lives
Browser config_browser_name varchar(10) Browsers name acronym
config_browser_version varchar (20) Browsers version
City location_geoip_city varchar (100) City where the visitor lives
Continent location_geoip_continent varchar (100) Continent where the visitor lives
Country location_geoip_country varchar (100) Country where the visitor lives
Device config_resolution varchar (9) Devices resolution
config_device_type tinyint(100) Kind of device
config_device_brand varchar(100) The brand of the device
config_device_model varchar(100) The model of the device
IP address location_ip varbinary (16) Connection IP address
ISP provider location_provider varchar(100) ISP provider
Keyword referrer_keyword varchar(255) set of words that has hit the web via search engine
Navigation step idaction_url_ref int(10) unsigned The previous action URL
Idaction_name_ref varchar (10 0 0) The previous action name
time_spent_ref_action int(10) unsigned The duration of the action
Operating system config_os char (3) Operating systems name acronym
config_os_version varchar(100) Operating systems version
Page idaction_name varchar (100) Pages URL name
server_time datetime Time when the action has happened
Path Idaction_url varchar (10 0 0) Page’s URL
Product idaction_sku int (10) Product ID
idaction_name varchar (100) Page’s URL name
idaction_category varchar (50) Product’s category
price float Product’s price
quantity int(10) Amount of products
deleted tinyint(1) if deleted of the active part of the catalogue
Region location_geoip_region varchar (2) Region where the visitor lives

Table A.3
Piwik used metrics and attributes in the ontology model.

Ontology class Attribute Type Description

Visit idVisit int (10) unsigned Visit ID


visit_first_action_time datetime Time when the first action of the visit happens
visit_last_action_time datetime Time when the last action of the visit happens
visit_exit_idaction_url varchar (10 0 0) The last action’s URL of the visit
visit_exit_idaction_name varchar (255) The last action’s name of the visit
visit_entry_idaction_url varchar (10 0 0) The first action’s URL of the visit
visit_entry_idaction_name varchar (255) The first action’s name of the visit
referer_name varchar (70) Website’s name referring to the landing page of the site
referer_url text Website’s URL referring to the landing page of the site
referer_type tinyint(1) unsigned The kind of referrer link (search engine, social net., etc.)
visit_total_actions smallint(5) unsigned Number of visit’s actions
visit_total_searches smallint(5) unsigned Number of visit’s searches
visit_total_events smallint(5) unsigned Number of visit’s events
visit_total_time smallint(5) unsigned Total time of the visit
visit_goal_converted tinyint(1) Whether or not this visit converted a goal
(continued on next page)
M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34 33

Table A.3 (continued)

Ontology class Attribute Type Description

visit_goal_buyer tinyint(1) If the visitor ordered something during this visit


Visitor idVisitor binary(8) Visitor ID
visitor_localtime time The time of the machine that the visitor use
visitor_returning tinyint(1) Whether or not the visitor is recurrent
visitor_count_visits smallint(5) unsigned Number of visits carried out by the visitor
visitor_days_since_last smallint(5) unsigned Number of days since the last visit of this visitor
visitor_days_since_order smallint(5) unsigned Number of days since the order of this visitor
visitor_days_since_first smallint(5) unsigned Number of days since the first visit of this visitor
Site idSite int(10) unsigned Site ID
name varchar (90) Site’s name
main_url varchar (255) The main URL of the site
timezone varchar(50) The site’s time zone (UTC)
currency char(3) The currency of the site
Idiom location_browser_lang varchar(20) Browser’s language

Table A.4
Piwik used metrics and attributes in the ontology model.

Ontology class Attribute Type Description

Goal idGoal int(10) unsigned Goal ID


name varchar (50) Goal’s name
match_attribute varchar (20) Related attribute with the goal
pattern varchar (255) How the goal can be converted
pattern_type varchar (10) The kind of the pattern
revenue Float The revenue per visit for each goal
deleted tinyint(4) Whether or not it has been deleted from the goals collection
allow_multiple tinyint(4) Allow goal to be triggered more than once per visit

Order idOrder varchar (100) order’s ID
idGoal int(10) the ID of the goal this conversion is for
revenue float The amount of revenue a conversion generates (if any)

revenue_subtotal float total cost of the items in the order/cart

items smallint unsigned number of items in the order/cart

revenue tax float total tax applied to the items in the order/cart

revenue_shipping float total cost of shipping

revenue_discount float total discount applied to the order


If this conversion is for an e-commerce order or abandoned cart.

Table A.5
Attributes of the web scrapping methods in the ontology model.

Ontology class Attribute Type Description

E-shop owner E-shop ID Integer ID of E-shop owner given by E-COMPASS Cockpit (user management)
Last name String Name of the person in charge (employee of the e-shop)
First name String Name of a person (employee of the e-shop)
E-Mail address String E-Mail address of the person in charge (employee of the e-shop)
E-shop URL String Start page of the e-shop
E-shop owner ID Integer ID of E-shop owner given by E-COMPASS Cockpit (user management)
Competitor ID E-shopID E-shop ID of all competitors
Product Product ID Integer Product ID of the E-COMPASS System
Name String Product Name given by E-Shop owner (e.g. as a search query)
Article Number Type String Type of article number (ASIN, EAN and/or ISBN)∗
Value String The value of product (ASIN, EAN and/or ISBN)∗
Price Value Double Price value on scraping date
Currency String Currency of Price
Date Date Scraping date of product price
Availability Value String Availability of product available or ”not available
Date Date Scraping date of availability


ASIN: Amazon Standard Identification Number, a ten-digit alpha-numerical product code;
EAN: European Article Number, 8-digit or 13-digit number for product identification;
ISBN: International Standard Book Number, 10-digit or 13-digit number for book identification.
34 M.d.M.R. García et al. / Expert Systems With Applications 63 (2016) 20–34

References Horrocks, I., & Patel-Schneider, P. (2003). Reducing owl entailment to description
logic satisfiability. In The semantic web - iswc 2003. In Lecture Notes in Computer
Akanbi, A. K. (2014). Lb2co: a semantic ontology framework for b2c ecommerce Science: 2870 (pp. 17–29). Springer Berlin.
transaction on the internet. International Journal of Research in Computer Science, McGuinness, D., & Harmelen, F. (2004). OWL web ontology language overview. Tech-
4(1), 1–9. nical Report. W3C Recommendation.
Dean, M., & Schreiber, G. (2004). OWL web ontology language reference. Technical Natalya, N., McGuinness, F., & Deborah, L. (2001). DOntology Development 101:
Report. W3C Recommendation, 10 February 2004. A Guide to Creating Your First Ontology. Technical Report. tanford University
Garía-Nieto, J., & Roldán, M. (2014). D2.1 SME-E-COMPASS requirements analysis. Knowledge Systems Laboratory Technical Report KSL-01-05.
Technical Report. Public Deliverable. Pérez, J., Arenas, M., & Gutierrez, C. (2009). Semantics and complexity of sparql.
Gatchalee, P., Li, Z., & Supnithi, T. (2013). Ontology development for smes e-com- ACM Transactions on Database Systems, 34(3), 1–45.
merce website based on content analysis and its recommendation system. In Staab, S., & Studer, R. (2009). Handbook on Ontologies. International Handbooks on
Computer science and engineering conference (icsec), 2013 international (pp. 7–12). Information Systems. Springer.
Gruber, T. R. (1993). A translation approach to portable ontologies. Knowledge Acqui- Tamma, V., Phelps, S., Dickinson, I., & Wooldridge, M. (2005). Ontologies for sup-
sition,, 5(2), 199–220. porting negotiation in e-commerce. Engineering Applications of Artificial Intelli-
Haase, P., & Stojanovic, L. (2005). Consistent evolution of owl ontologies. In gence, 18(2), 223–236.
A. Gmez-Prez, & J. Euzenat (Eds.), The semantic web: research and applications. Trastour, D., Bartolini, C., & Preist, C. (2003). Semantic web support for the business–
In Lecture Notes in Computer Science: 3532 (pp. 182–197). Springer Berlin Hei- to-business e-commerce pre-contractual lifecycle. Computer Networks, 42(5),
delberg. 661–673.
Hepp, M. (2008). Goodrelations: an ontology for describing products and ser- Waralakv, S. (2008). Learning semantic web from e-tourism. In N. Nguyen, G. Jo,
vices offers on the web. In Proceedings of the 16th international conference on R. Howlett, & L. Jain (Eds.), Agent and multi-agent systems: Technologies and ap-
knowledge engineering and knowledge management (ekaw2008) (pp. 332–347). plications. In Lecture Notes in Computer Science: 4953 (pp. 516–525). Springer
Springer LNCS, Vol 5268. Berlin Heidelberg.

You might also like