67% found this document useful (3 votes)
2K views562 pages

BigData Computing

BigData Computing

Uploaded by

user_scribd_com
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
67% found this document useful (3 votes)
2K views562 pages

BigData Computing

BigData Computing

Uploaded by

user_scribd_com
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Big Data

Computing
This page intentionally left blank
Big Data
Computing

Edited by
Rajendra Akerkar
Western Norway Research Institute
Sogndal, Norway
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2014 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works


Version Date: 20131028

International Standard Book Number-13: 978-1-4665-7838-8 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
To

All the visionary minds who have helped create a modern data science profession
This page intentionally left blank
Contents

Preface.......................................................................................................................ix
Editor.................................................................................................................... xvii
Contributors.......................................................................................................... xix

Section I  Introduction

1. Toward Evolving Knowledge Ecosystems


for Big Data Understanding..........................................................................3
Vadim Ermolayev, Rajendra Akerkar, Vagan Terziyan, and Michael Cochez

2. Tassonomy and Review of Big Data Solutions Navigation.................. 57


Pierfrancesco Bellini, Mariano di Claudio, Paolo Nesi, and Nadia Rauch

3. Big Data: Challenges and Opportunities............................................... 103


Roberto V. Zicari

Section II  Semantic Technologies and Big Data

4. Management of Big Semantic Data......................................................... 131


Javier D. Fernández, Mario Arias, Miguel A. Martínez-Prieto,
and Claudio Gutiérrez

5. Linked Data in Enterprise Integration.................................................... 169


Sören Auer, Axel-Cyrille Ngonga Ngomo, Philipp Frischmuth,
and Jakub Klimek

6. Scalable End-User Access to Big Data..................................................... 205


Martin Giese, Diego Calvanese, Peter Haase, Ian Horrocks,
Yannis Ioannidis, Herald Kllapi, Manolis Koubarakis, Maurizio Lenzerini,
Ralf Möller, Mariano Rodriguez Muro, Özgür Özçep, Riccardo Rosati,
Rudolf Schlatte, Michael Schmidt, Ahmet Soylu, and Arild Waaler

7. Semantic Data Interoperability: The Key Problem of Big Data........ 245


Hele-Mai Haav and Peep Küngas

vii
viii Contents

Section III  Big Data Processing

8. Big Data Exploration................................................................................... 273


Stratos Idreos

9. Big Data Processing with MapReduce.................................................... 295


Jordà Polo

10. Efficient Processing of Stream Data over Persistent Data................... 315


M. Asif Naeem, Gillian Dobbie, and Gerald Weber

Section IV  Big Data and Business

11. Economics of Big Data: A Value Perspective on State of the Art


and Future Trends.......................................................................................343
Tassilo Pellegrin

12. Advanced Data Analytics for Business................................................... 373


Rajendra Akerkar

Section V  Big Data Applications

13. Big Social Data Analysis............................................................................ 401


Erik Cambria, Dheeraj Rajagopal, Daniel Olsher, and Dipankar Das

14. Real-Time Big Data Processing for Domain Experts: An


Application to Smart Buildings................................................................ 415
Dario Bonino, Fulvio Corno, and Luigi De Russis

15. Big Data Application: Analyzing Real-Time Electric Meter Data..... 449
Mikhail Simonov, Giuseppe Caragnano, Lorenzo Mossucca, Pietro Ruiu,
and Olivier Terzo

16. Scaling of Geographic Space from the Perspective of City and


Field Blocks and Using Volunteered Geographic Information.........483
Bin Jiang and Xintao Liu

17. Big Textual Data Analytics and Knowledge Management................. 501


Marcus Spies and Monika Jungemann-Dorner

Index...................................................................................................................... 539
Preface

In the international marketplace, businesses, suppliers, and customers create


and consume vast amounts of information. Gartner* predicts that enterprise
data in all forms will grow up to 650% over the next five years. According
to IDC,† the world’s volume of data doubles every 18 months. Digital infor-
mation is doubling every 1.5 years and will exceed 1000 exabytes next year
according to the MIT Centre for Digital Research. In 2011, medical centers
held almost 1 billion terabytes of data. That is almost 2000 billion file cabinets’
worth of information. This deluge of data, often referred to as Big Data, obvi-
ously creates a challenge to the business community and data scientists.
The term Big Data refers to data sets the size of which is beyond the capa-
bilities of current database technology. It is an emerging field where innova-
tive technology offers alternatives in resolving the inherent problems that
appear when working with massive data, offering new ways to reuse and
extract value from information.
Businesses and government agencies aggregate data from numerous pri-
vate and/or public data sources. Private data is information that any orga-
nization exclusively stores that is available only to that organization, such
as employee data, customer data, and machine data (e.g., user transactions
and customer behavior). Public data is information that is available to the
public for a fee or at no charge, such as credit ratings, social media content
(e.g.,  LinkedIn, Facebook, and Twitter). Big Data has now reached every
sector in the world economy. It is transforming competitive opportunities
in every industry sector including banking, healthcare, insurance, manu-
facturing, retail, wholesale, transportation, communications, construction,
education, and utilities. It also plays key roles in trade operations such as
marketing, operations, supply chain, and new business models. It is becom-
ing rather evident that enterprises that fail to use their data efficiently are at a
large competitive disadvantage from those that can analyze and act on their
data. The possibilities of Big Data continue to evolve swiftly, driven by inno-
vation in the underlying technologies, platforms, and analytical capabilities
for handling data, as well as the evolution of behavior among its users as
increasingly humans live digital lives.
It is interesting to know that Big Data is different from the conventional
data models (e.g., relational databases and data models, or conventional gov-
ernance models). Thus, it is triggering organizations’ concern as they try to
separate information nuggets from the data heap. The conventional models
of structured, engineered data do not adequately reveal the realities of Big

* http://www.gartner.com/it/content/1258400/1258425/january_6_techtrends_rpaquet.pdf
† http://www.idc.com/

ix
x Preface

Data. The key to leveraging Big Data is to realize these differences before
expediting its use. The most noteworthy difference is that data are typically
governed in a centralized manner, but Big Data is self-governing. Big Data
is created either by a rapidly expanding universe of machines or by users of
highly varying expertise. As a result, the composition of traditional data will
naturally vary considerably from Big Data. The composition of data serves
a specific purpose and must be more durable and structured, whereas Big
Data will cover many topics, but not all topics will yield useful information
for the business, and thus they will be sparse in relevancy and structure.
The technology required for Big Data computing is developing at a sat-
isfactory rate due to market forces and technological evolution. The ever-
growing enormous amount of data, along with advanced tools of exploratory
data analysis, data mining/machine learning, and data visualization, offers
a whole new way of understanding the world.
Another interesting fact about Big Data is that not everything that is con-
sidered “Big Data” is in fact Big Data. One needs to explore deep into the
scientific aspects, such as analyzing, processing, and storing huge volumes
of data. That is the only way of using tools effectively. Data developers/
scientists need to know about analytical processes, statistics, and machine
learning. They also need to know how to use specific data to program algo-
rithms. The core is the analytical side, but they also need the scientific back-
ground and in-depth technical knowledge of the tools they work with in
order to gain control of huge volumes of data. There is no one tool that offers
this per se.
As a result, the main challenge for Big Data computing is to find a novel
solution, keeping in mind the fact that data sizes are always growing. This
solution should be applicable for a long period of time. This means that the
key condition a solution has to satisfy is scalability. Scalability is the ability of
a system to accept increased input volume without impacting the profits; that
is, the gains from the input increment should be proportional to the incre-
ment itself. For a system to be totally scalable, the size of its input should not
be a design parameter. Pushing the system designer to consider all possible
deployment sizes to cope with different input sizes leads to a scalable archi-
tecture without primary bottlenecks. Yet, apart from scalability, there are
other requisites for a Big Data–intensive computing system.
Although Big Data is an emerging field in data science, there are very few
books available in the market. This book provides authoritative insights and
highlights valuable lessons learnt by authors—with experience.
Some universities in North America and Europe are doing their part to
feed the need for analytics skills in this era of Big Data. In recent years,
they have introduced master of science degrees in Big Data analytics,
data science, and business analytics. Some contributing authors have been
involved in developing a course curriculum in their respective institution
and ­country. The number of courses on “Big Data” will increase worldwide
because it is becoming a key basis of competition, underpinning new waves
Preface xi

of productivity growth, innovation, and consumer surplus, according to a


research by MGI and McKinsey’s Business Technology Office.*
The main features of this book can be summarized as

1. It describes the contemporary state of the art in a new field of Big


Data computing.
2. It presents the latest developments, services, and main players in
this explosive field.
3. Contributors to the book are prominent researchers from academia
and practitioners from industry.

Organization
This book comprises five sections, each of which covers one aspect of Big Data
computing. Section I focuses on what Big Data is, why it is important, and
how it can be used. Section II focuses on semantic technologies and Big Data.
Section III focuses on Big Data processing—tools, technologies, and methods
essential to analyze Big Data efficiently. Section IV deals with business and
economic perspectives. Finally, Section V focuses on various stimulating Big
Data applications. Below is a brief outline with more details on what each
chapter is about.

Section I: Introduction
Chapter 1 provides an approach to address the problem of “understanding”
Big Data in an effective and efficient way. The idea is to make adequately
grained and expressive knowledge representations and fact collections that
evolve naturally, triggered by new tokens of relevant data coming along.
The chapter also presents primary considerations on assessing fitness in an
evolving knowledge ecosystem.
Chapter 2 then gives an overview of the main features that can character-
ize architectures for solving a Big Data problem, depending on the source of
data, on the type of processing required, and on the application context in
which it should be operated.

* http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_
data_The_next_frontier_for_innovation
xii Preface

Chapter 3 discusses Big Data from three different standpoints: the busi-
ness, the technological, and the social. This chapter lists some relevant initia-
tives and selected thoughts on Big Data.

Section II: Semantic Technologies and Big Data


Chapter 4 presents foundations of Big Semantic Data management. The
chapter sketches a route from the current data deluge, the concept of Big
Data, and the need of machine-processable semantics on the Web. Further,
this chapter justifies different management problems arising in Big Semantic
Data by characterizing their main stakeholders by role and nature.
A number of challenges arising in the context of Linked Data in Enterprise
Integration are covered in Chapter 5. A key prerequisite for addressing these
challenges is the establishment of efficient and effective link discovery and
data integration techniques, which scale to large-scale data scenarios found
in the enterprise. This chapter also presents the transformation step of
Linked Data Integration by two algorithms.
Chapter 6 proposes steps toward the solution of the data access prob-
lem that end-users usually face when dealing with Big Data. The chapter
discusses the state of the art in ontology-based data access (OBDA) and
explains why OBDA is the superior approach to the data access challenge
posed by Big Data. It also explains why the field of OBDA is currently not
yet sufficiently complete to deal satisfactorily with these problems, and it
finally presents thoughts on escalating OBDA to a level where it can be well
deployed to Big Data.
Chapter 7 addresses large-scale semantic interoperability problems of
data in the domain of public sector administration and proposes practical
solutions to these problems by using semantic technologies in the context
of Web services and open data. This chapter also presents a case of the
Estonian semantic interoperability framework of state information systems
and related data interoperability solutions.

Section III: Big Data Processing


Chapter 8 presents a new way of query processing for Big Data where data
exploration becomes a first-class citizen. Data exploration is desirable when
new big chunks of data arrive speedily and one needs to react quickly. This
chapter focuses on database systems technology, which for several years has
been the prime data-processing tool.
Preface xiii

Chapter 9 explores the MapReduce model, a programming model used to


develop largely parallel applications that process and generate large amounts
of data. This chapter also discusses how MapReduce is implemented in
Hadoop and provides an overview of its architecture.
A particular class of stream-based joins, namely, a join of a single stream
with a traditional relational table, is discussed in Chapter 10. Two available
stream-based join algorithms are investigated in this chapter.

Section IV: Big Data and Business


Chapter 11 provides the economic value of Big Data from a macro- and a
microeconomic perspective. The chapter illustrates how technology and
new skills can nurture opportunities to derive benefits from large, constantly
growing, dispersed data sets and how semantic interoperability and new
licensing strategies will contribute to the uptake of Big Data as a business
enabler and a source of value creation.
Nowadays businesses are enhancing their business intelligence prac-
tices to include predictive analytics and data mining. This combines the
best of strategic reporting and basic forecasting with advanced operational
intelligence and decision-making functions. Chapter 12 discusses how Big
Data technologies, advanced analytics, and business intelligence (BI) are
interrelated. This chapter also presents various areas of advanced analytic
technologies.

Section V: Big Data Applications


The final section of the book covers application topics, starting in Chapter 13
with novel concept-level approaches to opinion mining and sentiment analy-
sis that allow a more efficient passage from (unstructured) textual informa-
tion to (structured) machine-processable data, in potentially any domain.
Chapter 14 introduces the spChains framework, a modular approach to sup-
port mastering of complex event processing (CEP) queries in an abridged,
but effective, manner based on stream processing block composition. The
approach aims at unleashing the power of CEP systems for teams having
reduced insights into CEP systems.
Real-time electricity metering operated at subsecond data rates in a grid
with 20 million nodes originates more than 5 petabytes daily. The requested
decision-making timeframe in SCADA systems operating load shedding
might be lower than 100 milliseconds. Chapter 15 discusses the real-life
xiv Preface

optimization task and the data management approach permitting a solution


to the issue.
Chapter 16 presents an innovative outlook to the scaling of geographi-
cal space using large street networks involving both cities and countryside.
Given a street network of an entire country, the chapter proposes to decom-
pose the street network into individual blocks, each of which forms a mini-
mum ring or cycle such as city blocks and field blocks. The chapter further
elaborates the power of the block perspective in reflecting the patterns of
geographical space.
Chapter 17 presents the influence of recent advances in natural language
processing on business knowledge life cycles and processes of knowledge
management. The chapter also sketches envisaged developments and mar-
ket impacts related to the integration of semantic technology and knowledge
management.

Intended Audience
The aim of this book is to be accessible to researchers, graduate students,
and to application-driven practitioners who work in data science and related
fields. This edited book requires no previous exposure to large-scale data
analysis or NoSQL tools. Acquaintance with traditional databases is an
added advantage.
This book provides the reader with a broad range of Big Data concepts,
tools, and techniques. A wide range of research in Big Data is covered, and
comparisons between state-of-the-art approaches are provided. This book
can thus help researchers from related fields (such as databases, data sci-
ence, data mining, machine learning, knowledge engineering, information
retrieval, information systems), as well as students who are interested in
entering this field of research, to become familiar with recent research devel-
opments and identify open research challenges on Big Data. This book can
help practitioners to better understand the current state of the art in Big Data
techniques, concepts, and applications.
The technical level of this book also makes it accessible to students taking
advanced undergraduate level courses on Big Data or Data Science. Although
such courses are currently rare, with the ongoing challenges that the areas
of intelligent information/data management pose in many organizations in
both the public and private sectors, there is a demand worldwide for gradu-
ates with skills and expertise in these areas. It is hoped that this book helps
address this demand.
In addition, the goal is to help policy-makers, developers and engineers,
data scientists, as well as individuals, navigate the new Big Data landscape.
I believe it can trigger some new ideas for practical Big Data applications.
Preface xv

Acknowledgments
The organization and the contents of this edited book have benefited from our
outstanding contributors. I am very proud and happy that these researchers
agreed to join this project and prepare a chapter for this book. I am also very
pleased to see this materialize in the way I originally envisioned. I hope this
book will be a source of inspiration to the readers. I especially wish to express
my sincere gratitude to all the authors for their contribution to this project.
I thank the anonymous reviewers who provided valuable feedback and
helpful suggestions.
I also thank Aastha Sharma, David Fausel, Rachel Holt, and the staff at
CRC Press (Taylor & Francis Group), who supported this book project right
from the start.
Last, but not least, a very big thanks to my colleagues at Western Norway
Research Institute (Vestlandsforsking, Norway) for their constant encour-
agement and understanding.
I wish all readers a fruitful time reading this book, and hope that they expe-
rience the same excitement as I did—and still do—when dealing with Data.

Rajendra Akerkar
This page intentionally left blank
Editor

Rajendra Akerkar is professor and senior researcher at Western Norway


Research Institute (Vestlandsforsking), Norway, where his main domain
of research is semantic technologies with the aim of combining theoretical
results with high-impact real-world solutions. He also holds visiting aca-
demic assignments in India and abroad. In 1997, he founded and chaired the
Technomathematics Research Foundation (TMRF) in India.
His research and teaching experience spans over 23 years in academia
including different universities in Asia, Europe, and North America. His
research interests include ontologies, semantic technologies, knowledge sys-
tems, large-scale data mining, and intelligent systems.
He received DAAD fellowship in 1990 and is also a recipient of the pres-
tigious BOYSCASTS Young Scientist award of the Department of Science
and Technology, Government of India, in 1997. From 1998 to 2001, he was a
UNESCO-TWAS associate member at the Hanoi Institute of Mathematics,
Vietnam. He was also a DAAD visiting professor at Universität des Saarlan-
des and University of Bonn, Germany, in 2000 and 2007, respectively.
Dr. Akerkar serves as editor-in-chief of the International Journal of Computer
Science & Applications (IJCSA) and as an associate editor of the International
Journal of Metadata, Semantics, and Ontologies (IJMSO). He is co-organizer
of several workshops and program chair of the international conferences
ISACA, ISAI, ICAAI, and WIMS. He has co-authored 13 books, approxi-
mately 100 research papers, co-edited 2 e-books, and edited 5 volumes of
international conferences. He is also actively involved in several interna-
tional ICT initiatives and research & development projects and has been for
more than 16 years.

xvii
This page intentionally left blank
Contributors

Rajendra Akerkar Giuseppe Caragnano


Western Norway Research Institute Advanced Computing and
Sogndal, Norway Electromagnetic Unit
Istituto Superiore Mario Boella
Mario Arias Torino, Italy
Digital Enterprise Research Institute
National University of Ireland Michael Cochez
Galway, Ireland Faculty of Information
Technology
Sören Auer University of Jyväskylä
Enterprise Information Systems Jyväskylä, Finland
Department
Institute of Computer Science III Fulvio Corno
Rheinische Friedrich-Wilhelms- Department of Control and
Universität Bonn Computer Engineering
Bonn, Germany Polytechnic University of Turin
Turin, Italy
Pierfrancesco Bellini
Distributed Systems and Internet Dipankar Das
Technology Department of Computer
Department of Systems and Science
Informatics National University of Singapore
University of Florence Singapore
Firenze, Italy
Luigi De Russis
Dario Bonino Department of Control and
Department of Control and Computer Engineering
Computer Engineering Polytechnic University of Turin
Polytechnic University of Turin Turin, Italy
Turin, Italy
Mariano di Claudio
Diego Calvanese Department of Systems
Department of Computer Science and Informatics
Free University of Bozen-Bolzano University of Florence
Bolzano, Italy Firenze, Italy

Erik Cambria Gillian Dobbie


Department of Computer Science Department of Computer Science
National University of Singapore The University of Auckland
Singapore Auckland, New Zealand

xix
xx Contributors

Vadim Ermolayev Yannis Ioannidis


Zaporozhye National University Department of Computer Science
Zaporozhye, Ukraine National and Kapodistrian
University of Athens
Javier D. Fernández Athens, Greece
Department of Computer Science
University of Valladolid
Valladolid, Spain Bin Jiang
Department of Technology
Philipp Frischmuth and Built Environment
Department of Computer Science University of Gävle
University of Leipzig Gävle, Sweden
Leipzig, Germany
Monika Jungemann-Dorner
Martin Giese
Senior International Project
Department of Computer Science
Manager
University of Oslo
Verband der Verein
Oslo, Norway
Creditreform eV
Neuss, Germany
Claudio Gutiérrez
Department of Computer
Science Jakub Klimek
University of Chile Department of Computer
Santiago, Chile Science
University of Leipzig
Peter Haase Leipzig, Germany
Fluid Operations AG
Walldorf, Germany
Herald Kllapi
Hele-Mai Haav Department of Computer Science
Institute of Cybernetics National and Kapodistrian
Tallinn University of University of Athens
Technology Athens, Greece
Tallinn, Estonia
Manolis Koubarakis
Ian Horrocks Department of Computer
Department of Computer Science Science
Oxford University National and Kapodistrian
Oxford, United Kingdom University of Athens
Athens, Greece
Stratos Idreos
Dutch National Research Center Peep Küngas
for Mathematics and Computer Institute of Computer Science
Science (CWI) University of Tartu
Amsterdam, the Netherlands Tartu, Estonia
Contributors xxi

Maurizio Lenzerini Axel-Cyrille Ngonga Ngomo


Department of Computer Department of Computer Science
Science University of Leipzig
Sapienza University of Rome Leipzig, Germany
Rome, Italy
Daniel Olsher
Xintao Liu Department of Computer Science
Department of Technology National University of Singapore
and Built Environment Singapore
University of Gävle
Gävle, Sweden Özgür Özçep
Department of Computer Science
Miguel A. Martínez-Prieto TU Hamburg-Harburg
Department of Computer Hamburg, Germany
Science
University of Valladolid Tassilo Pellegrin
Valladolid, Spain Semantic Web Company
Vienna, Austria
Ralf Möller
Department of Computer Jordà Polo
Science Barcelona Supercomputing Center
TU Hamburg-Harburg (BSC)
Hamburg, Germany Technical University of Catalonia
(UPC)
Lorenzo Mossucca Barcelona, Spain
Istituto Superiore Mario Boella
Torino, Italy Dheeraj Rajagopal
Department of Computer Science
Mariano Rodriguez Muro National University of Singapore
Department of Computer Science Singapore
Free University of Bozen-Bolzano
Bolzano, Italy Nadia Rauch
Department of Systems
M. Asif Naeem and Informatics
Department of Computer Science University of Florence
The University of Auckland Firenze, Italy
Auckland, New Zealand
Riccardo Rosati
Paolo Nesi Department of Computer Science
Department of Systems Sapienza University of Rome
and Informatics Rome, Italy
Distributed Systems and Internet
Technology Pietro Ruiu
University of Florence Istituto Superiore Mario Boella
Firenze, Italy Torino, Italy
xxii Contributors

Rudolf Schlatte Vagan Terziyan


Department of Computer Department of Mathematical
Science Information Technology
University of Oslo University of Jyväskylä
Oslo, Norway Jyväskylä, Finland

Olivier Terzo
Michael Schmidt Advanced Computing and
Fluid Operations AG Electromagnetics Unit
Walldorf, Germany Istituto Superiore Mario Boella
Torino, Italy
Mikhail Simonov
Advanced Computing and Arild Waaler
Electromagnetics Unit Department of Computer Science
Istituto Superiore Mario Boella University of Oslo
Torino, Italy Oslo, Norway

Ahmet Soylu Gerald Weber


Department of Computer Department of Computer Science
Science The University of Auckland
University of Oslo Auckland, New Zealand
Oslo, Norway
Roberto V. Zicari
Marcus Spies Department of Computer Science
Ludwig-Maximilians University Goethe University
Munich, Germany Frankfurt, Germany
Section I

Introduction
This page intentionally left blank
1
Toward Evolving Knowledge Ecosystems for
Big Data Understanding

Vadim Ermolayev, Rajendra Akerkar, Vagan Terziyan,


and Michael Cochez

Contents
Introduction..............................................................................................................4
Motivation and Unsolved Issues...........................................................................6
Illustrative Example............................................................................................7
Demand in Industry............................................................................................9
Problems in Industry..........................................................................................9
Major Issues....................................................................................................... 11
State of Technology, Research, and Development in Big Data Computing... 12
Big Data Processing—Technology Stack and Dimensions.......................... 13
Big Data in European Research....................................................................... 14
Complications and Overheads in Understanding Big Data....................... 20
Refining Big Data Semantics Layer for Balancing Efficiency
Effectiveness....................................................................................................... 23
Focusing......................................................................................................... 25
Filtering.......................................................................................................... 26
Forgetting....................................................................................................... 27
Contextualizing............................................................................................. 27
Compressing................................................................................................. 29
Connecting..................................................................................................... 29
Autonomic Big Data Computing....................................................................30
Scaling with a Traditional Database.................................................................... 32
Large Scale Data Processing Workflows........................................................ 33
Knowledge Self-Management and Refinement through Evolution...............34
Knowledge Organisms, their Environments, and Features........................ 36
Environment, Perception (Nutrition), and Mutagens............................. 37
Knowledge Genome and Knowledge Body............................................. 39
Morphogenesis............................................................................................. 41
Mutation........................................................................................................42
Recombination and Reproduction.............................................................44
Populations of Knowledge Organisms.......................................................... 45
Fitness of Knowledge Organisms and Related Ontologies......................... 46

3
4 Big Data Computing

Some Conclusions.................................................................................................. 48
Acknowledgments................................................................................................. 50
References................................................................................................................ 50

Introduction
Big Data is a phenomenon that leaves a rare information professional negli-
gent these days. Remarkably, application demands and developments in the
context of related disciplines resulted in technologies that boosted data gen-
eration and storage at unprecedented scales in terms of volumes and rates. To
mention just a few facts reported by Manyika et al. (2011): a disk drive capable
of storing all the world’s music could be purchased for about US $600; 30 bil-
lion of content pieces are shared monthly only at Facebook (facebook.com).
Exponential growth of data volumes is accelerated by a dramatic increase in
social networking applications that allow nonspecialist users create a huge
amount of content easily and freely. Equipped with rapidly evolving mobile
devices, a user is becoming a nomadic gateway boosting the generation of
additional real-time sensor data. The emerging Internet of Things makes
every thing a data or content, adding billions of additional artificial and
autonomic sources of data to the overall picture. Smart spaces, where people,
devices, and their infrastructure are all loosely connected, also generate data
of unprecedented volumes and with velocities rarely observed before. An
expectation is that valuable information will be extracted out of all these data
to help improve the quality of life and make our world a better place.
Society is, however, left bewildered about how to use all these data effi-
ciently and effectively. For example, a topical estimate for the number of a
need for data-savvy managers to take full advantage of Big Data in the United
States is 1.5 million (Manyika et al. 2011). A major challenge would be finding
a balance between the two evident facets of the whole Big Data adventure: (a)
the more data we have, the more potentially useful patterns it may include
and (b) the more data we have, the less the hope is that any machine-learn-
ing algorithm is capable of discovering these patterns in an acceptable time
frame. Perhaps because of this intrinsic conflict, many experts consider that
this Big Data not only brings one of the biggest challenges, but also a most
exciting opportunity in the recent 10 years (cf. Fan et al. 2012b)
The avalanche of Big Data causes a conceptual divide in minds and opin-
ions. Enthusiasts claim that, faced with massive data, a scientific approach “. . .
hypothesize, model, test—is becoming obsolete. . . . Petabytes allow us to say:
‘Correlation is enough.’ We can stop looking for models. We can analyze the
data without hypotheses about what it might show. We can throw the numbers
into the biggest computing clusters the world has ever seen and let statistical
algorithms find patterns . . .” (Anderson 2008). Pessimists, however, point out
Toward Evolving Knowledge Ecosystems for Big Data Understanding 5

that Big Data provides “. . . destabilising amounts of knowledge and informa-


tion that lack the regulating force of philosophy” (Berry 2011). Indeed, being
abnormally big does not yet mean being healthy and wealthy and should be
treated appropriately (Figure 1.1): a diet, exercise, medication, or even surgery
(philosophy). Those data sets, for which systematic health treatment is ignored
in favor of correlations, will die sooner—as useless. There is a hope, however,
that holistic integration of evolving algorithms, machines, and people rein-
forced by research effort across many domains will guarantee required fitness
of Big Data, assuring proper quality at right time (Joseph 2012).
Mined correlations, though very useful, may hint about an answer to a
“what,” but not “why” kind of questions. For example, if Big Data about
Royal guards and their habits had been collected in the 1700s’ France, one
could mine today that all musketeers who used to have red Burgundy regu-
larly for dinners have not survived till now. Pity, red Burgundy was only one
of many and a very minor problem. A scientific approach is needed to infer
real reasons—the work currently done predominantly by human analysts.
Effectiveness and efficiency are the evident keys in Big Data analysis.
Cradling the gems of knowledge extracted out of Big Data would only be
effective if: (i) not a single important fact is left in the burden—which means
completeness and (ii) these facts are faceted adequately for further infer-
ence—which means expressiveness and granularity. Efficiency may be inter-
preted as the ratio of spent effort to the utility of result. In Big Data analytics,
it could be straightforwardly mapped to timeliness. If a result is not timely,
its utility (Ermolayev et al. 2004) may go down to zero or even far below in
seconds to milliseconds for some important industrial applications such as
technological process or air traffic control.
Notably, increasing effectiveness means increasing the effort or making the
analysis computationally more complex, which negatively affects efficiency.

Figure 1.1
Evolution of data collections—dimensions (see also Figure 1.3) have to be treated with care.
(Courtesy of Vladimir Ermolayev.)
6 Big Data Computing

Finding a balanced solution with a sufficient degree of automation is the


challenge that is not yet fully addressed by the research community.
One derivative problem concerns knowledge extracted out of Big Data as
the result of some analytical processing. In many cases, it may be expected
that the knowledge mechanistically extracted out of Big Data will also be
big. Therefore, taking care of Big Knowledge (which has more value than
the source data) would be at least of the same importance as resolving chal-
lenges associated with Big Data processing. Uplifting the problem to the
level of knowledge is inevitable and brings additional complications such as
resolving contradictory and changing opinions of everyone on everything.
Here, an adequate approach in managing the authority and reputation of
“experts” will play an important role (Weinberger 2012).
This chapter offers a possible approach in addressing the problem of
“understanding” Big Data in an effective and efficient way. The idea is mak-
ing adequately grained and expressive knowledge representations and fact
collections evolve naturally, triggered by new tokens of relevant data coming
along. Pursuing this way would also imply conceptual changes in the Big Data
Processing stack. A refined semantic layer has to be added to it for provid-
ing adequate interfaces to interlink horizontal layers and enable knowledge-
related functionality coordinated in top-down and bottom-up directions.
The remainder of the chapter is structured as follows. The “Motivation
and Unsolved Issues” section offers an illustrative example and the anal-
ysis of the demand for understanding Big Data. The “State of Technology,
Research, and Development in Big Data Computing” section reviews the
relevant research on using semantic and related technologies for Big Data
processing and outlines our approach to refine the processing stack. The
“Scaling with a Traditional Database” section focuses on how the basic
data storage and management layer could be refined in terms of scalability,
which is necessary for improving efficiency/effectiveness. The “Knowledge
Self-Management and Refinement through Evolution” section presents our
approach, inspired by the mechanisms of natural evolution studied in evo-
lutionary biology. We focus on a means of arranging the evolution of knowl-
edge, using knowledge organisms, their species, and populations with the
aim of balancing efficiency and effectiveness of processing Big Data and its
semantics. We also provide our preliminary considerations on assessing fit-
ness in an evolving knowledge ecosystem. Our conclusions are drawn in the
“Some Conclusions” section.

Motivation and Unsolved Issues


Practitioners, including systems engineers, Information Technology archi-
tects, Chief Information and Technology Officers, and data scientists, use
Toward Evolving Knowledge Ecosystems for Big Data Understanding 7

the phenomenon of Big Data in their dialog over means of improving sense-
making. The phenomenon remains a constructive way of introducing others,
including nontechnologists, to new approaches such as the Apache Hadoop
(hadoop.apache.org) framework. Apparently, Big Data is collected to be ana-
lyzed. “Fundamentally, big data analytics is a workflow that distills terabytes
of low-value data down to, in some cases, a single bit of high-value data. . . .
The goal is to see the big picture from the minutia of our digital lives” (cf.
Fisher et al. 2012). Evidently, “seeing the big picture” in its entirety is the key
and requires making Big Data healthy and understandable in terms of effec-
tiveness and efficiency for analytics.
In this section, the motivation for understanding the Big Data that improves
the performance of analytics is presented and analyzed. It begins with pre-
senting a simple example which is further used throughout the chapter. It
continues with the analysis of industrial demand for Big Data analytics. In
this context, the major problems as perceived by industries are analyzed and
informally mapped to unsolved technological issues.

Illustrative Example
Imagine a stock market analytics workflow inferring trends in share price
changes. One possible way of doing this is to extrapolate on stock price data.
However, a more robust approach could be extracting these trends from
market news. Hence, the incoming data for analysis would very likely be
several streams of news feeds resulting in a vast amount of tokens per day.
An illustrative example of such a news token is:

Posted: Tue, 03 Jul 2012 05:01:10-04:00


LONDON (Reuters)
U.S. planemaker Boeing hiked its 20-year market forecast, predicting
demand for 34,000 new aircraft worth $4.5 trillion, on growth in
emerging regions and as airlines seek efficient new planes to coun-
ter high fuel costs.*

Provided that an adequate technology is available,† one may extract


the knowledge pictured as thick-bounded and gray-shaded elements in
Figure 1.2.
This portion of extracted knowledge is quite shallow, as it simply inter-
prets the source text in a structured and logical way. Unfortunately, it does

* topix.com/aircraft/airbus-a380/2012/07/boeing-hikes-20-year-market-forecast (accessed July 5,


2012).
† The technologies for this are under intensive development currently, for example, wit.istc.

cnr.it/stlab-tools/fred/ (accessed October 8, 2012).


8 Big Data Computing

-baseOf * * -basedIn

Country Company -builtBy Plane

* -successorOf *
-built
* PlaneMaker MarketForecast
-has -by -SalesVolume *
*
-sellsTo
* * -predecessorOf
* -buysForm * -hikes -hiked by *
EfficientNewPlane
Airline -seeksFor -soughtBy
-fuelConsumption : <unspecified> = low
-built : <unspecified> = >2009
-delivered : Date
* *

* -owns Terminological component


-ownedBy 1

Owns
basedIn AllNipponAirways : Airline B787-JA812A : EfficientNewPlane Individual assertions
Owned by Fuel consumption = 20% lower than others
baseOf Built = >2009
delivered : Date = 2012/07/03
Japan : Country
built builtBy
has by
basedIn Boeing : PlaneMaker New20YMarketForecastbyBoeing : MarketForecast
SalesVolume = 4.5 trillion

baseOf hikes successorOf


predecessorOf
UnitedStates : Country hikedBy Old 20YMarketForecastbyBoeing : MarketForecast
SalesVolume

Figure 1.2
​Semantics associated with a news data token.

not answer several important questions for revealing the motives for Boeing
to hike their market forecast:

Q1. What is an efficient new plane? How is efficiency related to high


fuel costs to be countered?
Q2. Which airlines seek for efficient new planes? What are the emerg-
ing regions? How could their growth be assessed?
Q3. How are plane makers, airlines, and efficient new planes related to
each other?

In an attempt to answering these questions, a human analyst will exploit


his commonsense knowledge and look around the context for additional
relevant evidence. He will likely find out that Q1 and Q3 could be answered
using commonsense statements acquired from a foundational ontology,
for example, CYC (Lenat 1995), as shown by dotted line bounded items in
Figure 1.2.
Answering Q2, however, requires looking for additional information like:
the fleet list of All Nippon Airways* who was the first to buy B787 airplanes

* For example, at airfleets.net/flottecie/All%20Nippon%20Airways-active-b787.htm (accessed


July 5, 2012).
Toward Evolving Knowledge Ecosystems for Big Data Understanding 9

from Boeing (the rest of Figure 1.2); and a relevant list of emerging regions
and growth factors (not shown in Figure 1.2). The challenge for a human
analyst in performing the task is low speed of data analysis. The available
time slot for providing his recommendation is too small, given the effort to
be spent per one news token for deep knowledge extraction. This is one good
reason for growing demand for industrial strength technologies to assist in
analytical work on Big Data, increase quality, and reduce related efforts.

Demand in Industry
Turning available Big Data assets into action and performance is considered
a deciding factor by today’s business analytics. For example, the report by
Capgemini (2012) concludes, based on a survey of the interviews with more
than 600 business executives, that Big Data use is highly demanded in indus-
tries. Interviewees firmly believe that their companies’ competitiveness and
performance strongly depend on the effective and efficient use of Big Data.
In particular, on average,

• Big Data is already used for decision support 58% of the time, and
29% of the time for decision automation
• It is believed that the use of Big Data will improve organizational
performance by 41% over the next three years

The report by Capgemini (2012) also summarizes that the following are the
perceived benefits of harnessing Big Data for decision-making:

• More complete understanding of market conditions and evolving


business trends
• Better business investment decisions
• More accurate and precise responses to customer needs
• Consistency of decision-making and greater group participation in
shared decisions
• Focusing resources more efficiently for optimal returns
• Faster business growth
• Competitive advantage (new data-driven services)
• Common basis for evaluation (one true starting point)
• Better risk management

Problems in Industry
Though the majority of business executives firmly believe in the utility
of Big Data and analytics, doubts still persist about its proper use and the
availability of appropriate technologies. As a consequence, “We no longer
10 Big Data Computing

speak of the Knowledge Economy or the Information Society. It’s all data
now: Data Economy and Data Society. This is a confession that we are no
longer in control of the knowledge contained in the data our systems col-
lect” (Greller 2012).
Capgemini (2012) outlines the following problems reported by their
interviewees:

• Unstructured data are hard to process at scale. Forty-two percent of


respondents state that unstructured content is too difficult to inter-
pret. Forty percent of respondents believe that they have too much
unstructured data to support decision-making.
• Fragmentation is a substantial obstacle. Fifty-six percent of respondents
across all sectors consider organizational silos the biggest impedi-
ment to effective decision-making using Big Data.
• Effectiveness needs to be balanced with efficiency in “cooking” Big Data.
Eighty-five percent of respondents say the major problem is the lack
of effective ability to analyze and act on data in real time.

The last conclusion by Capgemini is also supported by Bowker (2005, pp.


183–184) who suggests that “raw data is both an oxymoron and a bad idea;
to the contrary, data should be cooked with care.” This argument is further
detailed by Bollier (2010, p. 13) who stresses that Big Data is a huge “mass of
raw information.” It needs to be added that this “huge mass” may change in
time with varying velocity, is also noisy, and cannot be considered as self-
explanatory. Hence, an answer to the question whether Big Data indeed rep-
resent a “ground truth” becomes very important—opening pathways to all
sorts of philosophical and pragmatic discussions. One aspect of particular
importance is interpretation that defines the ways of cleaning Big Data. Those
ways are straightforwardly biased because any interpretation is subjective.
As observed, old problems of data processing that are well known for
decades in industry are made even sharper when data becomes Big. Boyd
and Crawford (2012) point out several aspects to pay attention to while
“cooking” Big Data, hinting that industrial strength technologies for that are
not yet in place:

• Big Data changes the way knowledge is acquired and even defined. As
already mentioned above (cf. Anderson 2008), correlations mined
from Big Data may hint about model changes and knowledge repre-
sentation updates and refinements. This may require conceptually
novel solutions for evolving knowledge representation, reasoning,
and management.
• Having Big Data does not yet imply objectivity, or accuracy, on time. Here,
the clinch between efficiency and effectiveness of Big Data inter-
pretation and processing is one of the important factors. Selecting
Toward Evolving Knowledge Ecosystems for Big Data Understanding 11

a sample of an appropriate size for being effective may bring bias,


harm correctness, and accuracy. Otherwise, analyzing Big Data in
source volumes will definitely distort timeliness.
• Therefore, Big Data is not always the best option. A question that requires
research effort in this context is about the appropriate sample, size,
and granularity to best answer the question of a data analyst.
• Consequently, taken off-context Big Data is meaningless in interpreta-
tion. Indeed, choosing an appropriate sample and granularity may
be seen as contextualization—circumscribing (Ermolayev et  al.
2010) the part of data which is potentially the best-fitted sample for
the analytical query. Managing context and contextualization for
Big Data at scale is a typical problem and is perceived as one of the
research and development challenges.

One more aspect having indirect relevance to technology, but important


in terms of socio-psychological perceptions and impact on industries, is eth-
ics and Big Data divide. Ethics is concerned with legal regulations and con-
straints of allowing a Big Data collector interpreting personal or company
information without informing the subjects about it. Ethical issues become
sharper when used for competition and lead to the emergence of and sepa-
ration to Big Data rich and poor implied by accessibility to data sources at
required scale.

Major Issues
Applying Big Data analytics faces different issues related with the charac-
teristics of data, analysis process, and also social concerns. Privacy is a very
sensitive issue and has conceptual, legal, and technological implications.
This concern increases its importance in the context of big data. Privacy
is defined by the International Telecommunications Union as the “right of
individuals to control or influence what information related to them may
be disclosed” (Gordon 2005). Personal records of individuals are increas-
ingly being collected by several government and corporate organizations.
These records usually used for the purpose of data analytics. To facilitate
data analytics, such organizations publish “appropriately private” views
over the collected data. However, privacy is a double-edged sword—there
should be enough privacy to ensure that sensitive information about the
individuals is not disclosed and at the same time there should be enough
data to perform the data analysis. Thus, privacy is a primary concern that
has widespread implications for someone desiring to explore the use of Big
Data for development in terms of data acquisition, storage, preservation,
presentation, and use.
Another concern is the access and sharing of information. Usually private
organizations and other institutions are reluctant to share data about their
12 Big Data Computing

clients and users, as well as about their own operations. Barriers may include
legal considerations, a need to protect their competitiveness, a culture of con-
fidentiality, and, largely, the lack of the right incentive and information struc-
tures. There are also institutional and technical issues, when data are stored
in places and ways that make them difficult to be accessed and transferred.
One significant issue is to rethink security for information sharing in Big
Data use cases. Several online services allow us to share private informa-
tion (i.e., facebook.com, geni.com, linkedin.com, etc.), but outside record-level
access control we do not comprehend what it means to share data and how
the shared data can be linked.
Managing large and rapidly increasing volumes of data has been a chal-
lenging issue. Earlier, this issue was mitigated by processors getting faster,
which provide us with the resources needed to cope with increasing vol-
umes of data. However, there is a fundamental shift underway considering
that data volume is scaling faster than computer resources. Consequently,
extracting sense of data at required scale is far beyond human capability.
So, we, the humans, increasingly “. . . require the help of automated systems
to make sense of the data produced by other (automated) systems” (Greller
2012). These instruments produce new data at comparable scale—kick-start-
ing a new iteration in this endless cycle.
In general, given a large data set, it is often necessary to find elements
in it that meet a certain criterion which likely occurs repeatedly. Scanning
the entire data set to find suitable elements is obviously impractical. Instead,
index structures are created in advance to permit finding the qualifying ele-
ments quickly.
Moreover, dealing with new data sources brings a significant number of
analytical issues. The relevance of these issues will vary depending on the
type of analysis being conducted and on the type of decisions that the data
might ultimately inform. The big core issue is to analyze what the data are
really telling us in an entirely transparent manner.

State of Technology, Research, and Development in Big Data


Computing
After giving an overview of the influence of Big Data on industries and
society as a phenomenon and outlining the problems in Big Data computing
context as perceived by technology consumers, we now proceed with the
analysis of the state of development of those technologies. We begin with
presenting the overall Big Data Processing technology stack and point out
how different dimensions of Big Data affect the requirements to technolo-
gies, having understanding—in particular, semantics-based processing—
as a primary focus. We continue with presenting a selection of Big Data
Toward Evolving Knowledge Ecosystems for Big Data Understanding 13

research and development projects and focus on what they do in advanc-


ing the state-of-the-art in semantic technologies for Big Data processing.
Further, we summarize the analysis by pointing out the observed complica-
tions and overheads in processing Big Data semantics. Finally, we outline a
high-level proposal for the refinement of the Big Data semantics layer in the
technology stack.

Big Data Processing—Technology Stack and Dimensions


At a high level of detail, Driscoll (2011) describes the Big Data processing
technology stack comprising three major layers: foundational, analytics, and
applications (upper part of Figure 1.3).
The foundational layer provides the infrastructure for storage, access, and
management of Big Data. Depending on the nature of data, stream process-
ing solutions (Abadi et  al. 2003; Golab and Tamer Ozsu 2003; Salehi 2010),
distributed persistent storage (Chang et al. 2008; Roy et al. 2009; Shvachko
et  al. 2010), cloud infrastructures (Rimal et  al. 2009; Tsangaris et  al. 2009;
Cusumano 2010), or a reasonable combination of these (Gu and Grossman
2009; He et al. 2010; Sakr et al. 2011) may be used for storing and accessing
data in response to the upper-layer requests and requirements.

Focused services

Big Data analytics


Variety
Velocity

Efficiency... ...Effectiveness
Volume Complexity
Big Data storage, access, management infrastructure
Query planning Data
and exceution management
Velocity

Volume

Data stream Storage


processing

Velocity Efficiency Volume

Big Data

Variety Complexity
Effectiveness

Figure 1.3
Processing stack, based on Driscoll (2011), and the four dimensions of Big Data, based on Beyer
et al. (2011), influencing efficiency and effectiveness of analytics.
14 Big Data Computing

The middle layer of the stack is responsible for analytics. Here data ware-
housing technologies (e.g., Nemani and Konda 2009; Ponniah 2010; Thusoo
et  al. 2010) are currently exploited for extracting correlations and features
(e.g., Ishai et  al. 2009) from data and feeding classification and prediction
algorithms (e.g., Mills 2011).
Focused applications or services are at the top of the stack. Their func-
tionality is based on the use of more generic lower-layer technologies and
exposed to end users as Big Data products.
Example of a startup offering focused services is BillGuard (billguard.
com). It monitors customers’ credit card statements for dubious charges and
even leverages the collective behavior of users to improve its fraud predic-
tions. Another company called Klout (klout.com/home) provides a genuine
data service that uses social media activity to measure online influence.
LinkedIn’s People you may know feature is also a kind of focused service. This
service is presumably based on graph theory, starting exploration of the
graph of your relations from your node and filtering those relations accord-
ing to what is called “homophily.” The greater the homophily between two
nodes, the more likely two nodes will be connected.
According to its purpose, the foundational layer is concerned about being
capable of processing as much as possible data (volume) and as soon as pos-
sible. In particular, if streaming data are used, the faster the stream is (veloc-
ity), the more difficult it is to process the data in a stream window. Currently
available technologies and tools for the foundational level are not equally
well coping with volume and velocity dimensions which are, so to say, anti-
correlated due to their nature. Therefore, hybrid infrastructures are in use
for balancing processing efficiency aspects (Figure 1.3)—comprising solu-
tions focused on taking care of volumes, and, separately, of velocity. Some
examples are given in “Big Data in European Research” section.
For the analytics layer (Figure 1.3), volume and velocity dimensions (Beyer
et al. 2011) are also important and constitute the facet of efficiency—big vol-
umes of data which may change swiftly have to be processed in a timely
fashion. However, two more dimensions of Big Data become important—
complexity and variety—which form the facet of effectiveness. Complexity
is clearly about the adequacy of data representations and descriptions for
analysis. Variety describes a degree of syntactic and semantic heterogeneity
in distributed modules of data that need to be integrated or harmonized for
analysis. A major conceptual complication for analytics is that efficiency is
anticorrelated to effectiveness.

Big Data in European Research


Due to its huge demand, Big Data Computing is currently on the hype as a
field of research and development, producing a vast domain of work. To keep
the size of this review observable for a reader, we focus on the batch of the
running 7th Framework Programme (FP7) Information and Communication
Toward Evolving Knowledge Ecosystems for Big Data Understanding 15

Technology (ICT; cordis.europa.eu/fp7/ict/) projects within this vibrant


field. Big Data processing, including semantics, is addressed by the strate-
gic objective of Intelligent Information Management (IIM; cordis.europa.eu/
fp7/ict/content-knowledge/projects_en.html). IIM projects funded in frame
of FP7 ICT Call 5 are listed in Table 1.1 and further analyzed below.
SmartVortex [Integrating Project (IP); smartvortex.eu] develops a techno-
logical infrastructure—a comprehensive suite of interoperable tools, ser-
vices, and methods—for intelligent management and analysis of massive
data streams. The goal is to achieving better collaboration and decision-
making in large-scale collaborative projects concerning industrial innova-
tion engineering.
Legend: AEP, action extraction and prediction; DLi, data linking; DM, data
mining; DS, diversity in semantics; DV, domain vocabulary; FCA, formal
concept analysis; IE, information extraction; Int, integration; KD, knowledge
discovery; M-LS, multi-lingual search; MT, machine translation; O, ontology;
OM, opinion mining; QL, query language; R, reasoning; SBI, business intelli-
gence over semantic data; SDW, semantic data warehouse (triple store); SUM,
summarization.
LOD2 (IP; lod2.eu) claims delivering: industrial strength tools and meth-
odologies for exposing and managing very large amounts of structured
information; a bootstrap network of multidomain and multilingual ontolo-
gies from sources such as Wikipedia (wikipedia.org) and OpenStreetMap
(openstreetmap.org); machine learning algorithms for enriching, repairing,
interlinking, and fusing data from Web resources; standards and methods for
tracking provenance, ensuring privacy and data security, assessing informa-
tion quality; adaptive tools for searching, browsing, and authoring Linked
Data.
Tridec (IP; tridec-online.eu) develops a service platform accompanied with
the next-generation work environments supporting human experts in deci-
sion processes for managing and mitigating emergency situations triggered
by the earth (observation) system in complex and time-critical settings. The
platform enables “smart” management of collected sensor data and facts
inferred from these data with respect to crisis situations.
First [Small Targeted Research Project (STREP); project-first.eu] develops
an information extraction, integration, and decision-making infrastructure
for financial domain with extremely large, dynamic, and heterogeneous
sources of information.
iProd (STREP; iprod-project.eu) investigates approaches of reducing prod-
uct development costs by efficient use of large amounts of data comprising
the development of a software framework to support complex information
management. Key aspects addressed by the project are handling hetero-
geneous information and semantic diversity using semantic technologies
including knowledge bases and reasoning.
Teleios (STREP; earthobservatory.eu) focuses on elaborating a data
model and query language for Earth Observation (EO) images. Based on
16

Table 1.1
FP7 ICT Call 5 Projects and their Contributions to Big Data Processing and Understanding
Contribution
to Coping Contribution to
with Big Big Data
Data Processing Stack
IIM Clustera dimensionsb Layersc

Contribution
ii. Fast
Services

Variety

Volume
Velocity

Information
Analytics of

Exploitation
iii. Focused

Knowledge
to Big Data
Data at Scale

Social Media
Complexity

Management
Discovery and
i. Fast Access

Interactive and
of Data at Scale

Reasoning and

Online Content,
to/Management
Acronym Domain(s)/Industry(ies) Understanding
SmartVortex X X X Industrial innovation engineering X X X X
LOD2 X X Media and publishing, corporate data X X X X X O, ML
intranets, eGovernment
Tridec X X Crisis/emergency response, government, X X X X X R
oil and gas
First X X Market surveillance, investment X X X X X IE
management, online retail banking and
brokerage
iProd X X Manufacturing: aerospace, automotive, X X X X R, Int
and home appliances
Teleios X X Civil defense, environmental agencies. X X X X DM, QL, KD
Use cases: a virtual observatory for
TerraSAR-X data; real-time fire
monitoring
Big Data Computing
Khresmoi X X Medical imaging in healthcare, X X X X X IE, DLi, M-LS,
biomedicine MT
Robust X X Online communities (internet, extranet X X X AEP
and intranet) addressing: customer
support; knowledge sharing; hosting
services
Digital.me X X Personal sphere X X
Fish4Knowledge X X Marine sciences, environment X X X DV (fish), SUM
Render X X Information management (wiki), news X X X DS
aggregation (search engine), customer
relationship management
(telecommunications)
PlanetData X Cross-domain X
LATC X Government X X X
Advance X Logistics X X X X
Cubist X Market intelligence, computational X X X SDW, SBI, FCA
biology, control centre operations
Promise X Cross-domain X X X X
Dicode X Clinico-genomic research, healthcare, X X X X DM, OM
marketing
a IIM clustering information has been taken from the Commission’s source cordis.europa.eu/fp7/ict/content-knowledge/projects_en.html.
b As per the Gartner report on extreme information management (Gartner 2011).
c The contributions of the projects to the developments in the Big Data Stack layers have been assessed based on their public deliverables.
Toward Evolving Knowledge Ecosystems for Big Data Understanding
17
18 Big Data Computing

these, a scalable and adaptive environment for knowledge discovery from


EO images and geospatial data sets, and a query processing and optimi-
zation technique for queries over multidimensional arrays and EO image
annotations are developed and implemented on top of the MonetDB (mon-
etdb.org) system.
Khresmoi (IP; khresmoi.eu) develops an advanced multilingual and multi-
modal search and access system for biomedical information and documents.
The advancements of the Khresmoi comprise: an automated information
extraction from biomedical documents reinforced by using crowd sourcing,
active learning, automated estimation of trust level, and target user exper-
tise; automated analysis and indexing for 2-, 3-, 4D medical images; link-
ing information extracted from unstructured or semistructured biomedical
texts and images to structured information in knowledge bases; multilin-
gual search including multiple languages in queries and machine-translated
pertinent excerpts; visual user interfaces to assist in formulating queries and
displaying search results.
Robust (IP; robust-project.eu) investigates models and methods for describ-
ing, understanding, and managing the users, groups, behaviors, and needs
of online communities. The project develops a scalable cloud and stream-
based data management infrastructure for handling the real-time analysis of
large volumes of community data. Understanding and prediction of actions
is envisioned using simulation and visualization services. All the developed
tools are combined under the umbrella of the risk management framework,
resulting in the methodology for the detection, tracking, and management of
opportunities and threats to online community prosperity.
Digital.me (STREP; dime-project.eu) integrates all personal data in a per-
sonal sphere at a single user-controlled point of access—a user-controlled
personal service for intelligent personal information management. The soft-
ware is targeted on integrating social web systems and communities and
implements decentralized communication to avoid external data storage and
undesired data disclosure.
Fish4Knowledge (STREP; homepages.inf.ed.ac.uk/rbf/Fish4Knowledge/)
develops methods for information abstraction and storage that reduce the
amount of video data at a rate of 10 × 1015 pixels to 10 × 1012 units of informa-
tion. The project also develops machine- and human-accessible vocabularies
for describing fish. The framework also comprises flexible data-processing
architecture and a specialized query system tailored to the domain. To
achieve these, the project exploits a combination of computer vision, video
summarization, database storage, scientific workflow, and human–computer
interaction methods.
Render (STREP; render-project.eu) is focused on investigating the aspect
of diversity of Big Data semantics. It investigates methods and techniques,
develops software, and collects data sets that will leverage diversity as
a source of innovation and creativity. The project also claims providing
enhanced support for feasibly managing data on a very large scale and for
Toward Evolving Knowledge Ecosystems for Big Data Understanding 19

designing novel algorithms that reflect diversity in the ways information is


selected, ranked, aggregated, presented, and used.
PlanetData [Network of Excellence (NoE); planet-data.eu] works toward
establishing a sustainable European community of researchers that supports
organizations in exposing their data in new and useful ways and develops
technologies that are able to handle data purposefully at scale. The network
also facilitates researchers’ exchange, training, and mentoring, and event
organization based substantially on an open partnership scheme.
LATC (Support Action; latc-project.eu) creates an in-depth test-bed for data-
intensive applications by publishing data sets produced by the European
Commission, the European Parliament, and other European institutions as
Linked Data on the Web and by interlinking them with other governmental
data.
Advance (STREP; advance-logistics.eu) develops a decision support plat-
form for improving strategies in logistics operations. The platform is based
on the refinement of predictive analysis techniques to process massive data
sets for long-term planning and cope with huge amounts of new data in real
time.
Cubist (STREP; cubist-project.eu) elaborates methodologies and imple-
ments a platform that brings together several essential features of Semantic
Technologies and Business Intelligence (BI): support for the federation of
data coming from unstructured and structured sources; a BI-enabled triple
store as a data persistency layer; data volume reduction and preprocessing
using data semantics; enabling BI operations over semantic data; a semantic
data warehouse implementing FCA; applying visual analytics for rendering,
navigating, and querying data.
Promise (NoE; promise-noe.eu) establishes a virtual laboratory for conduct-
ing participative research and experimentation to carry out, advance, and
bring automation into the evaluation and benchmarking of multilingual and
multimedia information systems. The project offers the infrastructure for
access, curation, preservation, re-use, analysis, visualization, and mining of
the collected experimental data.
Dicode (STREP; dicode-project.eu) develops a workbench of interoperable
services, in particular, for: (i) scalable text and opinion mining; (ii) collabora-
tion support; and (iii) decision-making support. The workbench is designed
to reduce data intensiveness and complexity overload at critical decision
points to a manageable level. It is envisioned that the use of the workbench
will help stakeholders to be more productive and concentrate on creative
activities.
In summary, the contributions to Big Data understanding of all the proj-
ects mentioned above result in the provision of different functionality for a
semantic layer—an interface between the Data and Analytics layers of the
Big Data processing stack—as pictured in Figure 1.4.
However, these advancements remain somewhat insufficient in terms of
reaching a desired balance between efficiency and effectiveness, as outlined
20 Big Data Computing

Focused services

MT SBI
Big data analytics

Query formulation/ QL, Reasoning Integration: DLi, Int


transformation ML-S R, AP,
OM Representation: O, DV
Extraction/ AE, DE, IE,
elicitation KD, FCA, SUM DS Storage: SDW
Big data semantics

Query planning Data


and execution management
Storage
Data stream
processing

Big data storage, access, management

Figure 1.4
Contribution of the selection of FP7 ICT projects to technologies for Big Data understanding.
Abbreviations are explained in the legend to Table 1.1.

in the introduction of this chapter. Analysis of Table 1.1 reveals that no one
of the reviewed projects addresses all four dimensions of Big Data in a bal-
anced manner. In particular, only two projects—Trydec and First—claim
contributions addressing Big Data velocity and variety-complexity. This fact
points out that the clinch between efficiency and effectiveness in Big Data
processing still remains a challenge.

Complications and Overheads in Understanding Big Data


As observed, the mankind collects and stores data through generations,
without a clear account of the utility of these data. Out of data at hand,
each generation extracts a relatively small proportion of knowledge for
their everyday needs. The knowledge is produced by a generation for their
needs—to an extent they have to satisfy their “nutrition” requirement for
supporting decision-making. Hence, knowledge is “food” for data analyt-
ics. An optimistic assumption usually made here is that the next generation
will succeed in advancing tools for data mining, knowledge discovery, and
extraction. So the data which the current generation cannot process effec-
tively and efficiently is left as a legacy for the next generation in a hope that
the ancestors cope better. The truth, however, is that the developments of
data and knowledge-processing tools fail to keep pace with the explosive
growth of data in all four dimensions mentioned above. Suspending under-
standing Big Data until an advanced next-generation capability is at hand is
therefore an illusion of a solution.
Toward Evolving Knowledge Ecosystems for Big Data Understanding 21

Do today’s state-of-the-art technologies allow us to understand Big Data


with an attempt to balance effectiveness and efficiency?—probably not.
Our brief analysis reveals that Big Data computing is currently developed
toward more effective versus efficient use of semantics. It is done by add-
ing the semantics layer to the processing stack (cf. Figures 1.3 and 1.4) with
an objective of processing all the available data and using all the generated
knowledge. Perhaps, the major issue is the attempt to eat all we have on the
table. Following the metaphor of “nutrition,” it has to be noted that the “food”
needs to be “healthy” in terms of all the discussed dimensions of Big Data.
Our perceptions of the consequences of being not selective with respect to
consuming data for understanding are as follows.
The major problem is the introduction of a new interface per se and in an
improper way. The advent of semantic technologies aimed at breaking down
data silos and simultaneously enabling efficient knowledge management at
scale. Assuming that databases describe data using multiple heterogeneous
labels, one might expect that annotating these labels using ontology ele-
ments as semantic tags enables virtual integration and provides immedi-
ate benefits for search, retrieval, reasoning, etc. without a need to modify
existing code, or data. Unfortunately, as noticed by Smith (2012), it is now
too easy to create “ontologies.” As a consequence, myriads of them are being
created in ad hoc ways and with no respect to compatibility, which implies
the creation of new semantic silos and, further bringing something like a
“Big Ontology” challenge to the agenda. According to Smith (2012), the big
reason is the lack of a rational (monetary) incentive for investing in reuse.
Therefore, it is often accepted that a new “ontology” is developed for a new
project. Harmonization is left for someone else’s work—in the next genera-
tion. Therefore, the more semantic technology simplifying ontology creation
is successful, the more we fail to achieve our goals for interoperability and
integration (Smith 2012).
It is worth noting here that there is still a way to start doing things cor-
rectly which, according to Smith (2012), would be “to create an incremental,
evolutionary process, where what is good survives, and what is bad fails;
create a scenario in which people will find it profitable to reuse ontologies,
terminologies and coding systems which have been tried and tested; silo
effects will be avoided and results of investment in Semantic Technology
will cumulate effectively.”
A good example of a collaborative effort going in this correct direction
is the approach used by the Gene Ontology initiative (geneontology.org)
which follows the principles of the OBO Foundry (obofoundry.org). The
Gene Ontology project is a major bioinformatics initiative with the aim of
standardizing the representation of gene and gene product attributes across
species and databases. The project provides a controlled vocabulary of terms
for describing gene product characteristics and gene product annotation
data, as well as tools to access and process this data. The mission of OBO
Foundry is to support community members in developing and publishing
22 Big Data Computing

fully interoperable ontologies in the biomedical domain following common


evolving design philosophy and implementation and ensuring a gradual
improvement of the quality of ontologies.
Furthermore, adding a data semantics layer facilitates increasing effec-
tiveness in understanding Big Data, but also substantially increases the
computational overhead for processing the representations of knowl-
edge—decreasing efficiency. A solution is needed that harmonically and
rationally balances between the increase in the adequacy and the com-
pleteness of Big Data semantics, on the one hand, and the increase in com-
putational complexity, on the other hand. A straightforward approach is
using scalable infrastructures for processing knowledge representations.
A vast body of related work focuses on elaborating this approach (e.g.,
Broekstra et al. 2002; Wielemaker et al. 2003; Cai and Frank 2004; DeCandia
et al. 2007).
The reasons to qualifying this approach only as a mechanistic solution are

• Using distributed scalable infrastructures, such as clouds or


grids, implies new implementation problems and computational
overheads.
• Typical tasks for processing knowledge representations, such as
reasoning, alignment, query formulation and transformation, etc.,
scale hardly (e.g., Oren et  al. 2009; Urbani et  al. 2009; Hogan et  al.
2011)—more expressiveness implies harder problems in decoupling
the fragments for distribution. Nontrivial optimization, approxima-
tion, or load-balancing techniques are required.

Another effective approach to balance complexity and timeliness is main-


taining history or learning from the past. A simple but topical example in
data processing is the use of previously acquired information for saving
approximately 50% of comparison operations in sorting by selection (Knuth
1998, p. 141). In Distributed Artificial Intelligence software, agent architec-
tures maintaining their states or history for more efficient and effective
deliberation have also been developed (cf. Dickinson and Wooldridge 2003).
In Knowledge Representation and Reasoning, maintaining history is often
implemented as inference or query result materialization (cf. Kontchakov
et al. 2010; McGlothlin and Khan 2010), which also do not scale well up to the
volumes characterizing real Big Data.
Yet another way to find a proper balance is exploiting incomplete or
approximate methods. These methods yield results of acceptable quality
much faster than approaches aiming at building fully complete or exact, that
is, ideal results. Good examples of technologies for incomplete or partial rea-
soning and approximate query answering (Fensel et al. 2008) are elaborated
in the FP7 LarKC project (larkc.eu). Remarkably, some of the approximate
querying techniques, for example, Guéret et al. (2008), are based on evolu-
tionary computing.
Toward Evolving Knowledge Ecosystems for Big Data Understanding 23

Refining Big Data Semantics Layer for Balancing Efficiency Effectiveness


As one may notice, the developments in the Big Data semantics layer are
mainly focused on posing and appropriately transforming the semantics of
queries all the way down to the available data, using networked ontologies.
At least two shortcomings of this, in fact, unidirectional* approach need to
be identified:

1. Scalability overhead implies insufficient efficiency. Indeed, executing


queries at the data layer implies processing volumes at the scale of
stored data. Additional overhead is caused by the query transforma-
tion, distribution, and planning interfaces. Lifting up the stack and
fusing the results of these queries also imply similar computational
overheads. A possible solution for this problem may be sought fol-
lowing a supposition that the volume of knowledge describing data
adequately for further analyses is substantially smaller than the vol-
ume of this data. Hence, down-lifting queries for execution need to
be stopped at the layer of knowledge storage for better efficiency.
However, the knowledge should be consistent enough with the data
so that it can fulfill completeness and correctness requirements
specified in the contract of the query engine.
2. Having ontologies inconsistent with data implies effectiveness problems.
Indeed, in the vast majority of cases, the ontologies containing
knowledge about data are not updated consistently with the changes
in data. At best, these knowledge representations are revised in a
sequence of discrete versions. So, they are not consistent with the
data at an arbitrary point in time. This shortcoming may be over-
come only if ontologies in a knowledge repository evolve continu-
ously in response to data change. Ontology evolution will have a
substantially lower overhead because the volume of changes is
always significantly lower than the volume of data, though depends
on data velocity (Figure 1.3).

To sum up, relaxing the consequences of the two outlined shortcomings


and, hence, balancing efficiency and effectiveness may be achievable if a
bidirectional processing approach is followed. Top-down query answering
has to be complemented by a bottom-up ontology evolution process, which
meet at the knowledge representation layer. In addition to a balance between
efficiency and effectiveness, such an approach of processing huge data sets
may help us “. . . find and see dynamically changing ontologies without hav-

* Technologies for information and knowledge extraction are also developed and need to be
regarded as bottom-up. However, these technologies are designed to work off-line for updat-
ing the existing ontologies in a discrete manner. Their execution is not coordinated with the
top-down query processing and data changes. So, the shortcomings outlined below persist.
24 Big Data Computing

ing to try to prescribe them in advance.* Taxonomies and ontologies are


things that you might discover by observation, and watch evolve over time”
(cf. Bollier 2010).
Further, we focus on outlining a complementary bottom-up path in the
overall processing stack which facilitates existing top-down query answering
frameworks by providing knowledge evolution in line with data change—as
pictured in Figure 1.5. In a nutshell, the proposed bottom-up path is charac-
terized by:

• Efficiently performing simple scalable queries on vast volumes of


data or in a stream window for extracting facts and decreasing vol-
umes (more details could be found in the “Scaling with a Traditional
Database” section)
• Adding extracted facts to a highly expressive persistent knowledge
base allowing evolution of knowledge (more details on that could
be seen in Knowledge Self-Management and Refinement through
Evolution)
• Assessing fitness of knowledge organisms and knowledge represen-
tations in the evolving knowledge ecosystem (our approach to that is
also outlined in the “Knowledge Self-Management and Refinement
through Evolution” section)

This will enable reducing the overheads of the top-down path by perform-
ing refined inference using highly expressive and complex queries over

Big Data analytics


for query answering
Top-down path

Big Data semantics Semantic query planning and answering


Knowledge Knowledge evolution

Change Distributed persistent knowledge storage


extraction and management

harmonization

Evolution
Bottom-up path for knowledge

Contextualization

Extraction
evolution

Query planning and execution


Data Persistent
management distributed
Data stream
processing storage

Big Data storage, access, management

Figure 1.5
Refining Big Data semantics layer for balancing efficiency and effectiveness.

* Underlined by the authors of this chapter.


Toward Evolving Knowledge Ecosystems for Big Data Understanding 25

evolving (i.e., consistent with data) and linked (i.e., harmonized), but reason-
ably small fragments of knowledge. Query results may also be materialized
for further decreasing computational effort.
After outlining the abstract architecture and the bottom-up approach, we
will now explain at a high level how Big Data needs to be treated along the
way. A condensed formula for this high-level approach is “3F + 3Co” which
is unfolded as

3F: Focusing-Filtering-Forgetting
3Co: Contextualizing-Compressing-Connecting

Notably, both 3F and 3Co are not novel and used in parts extensively in
many domains and in different interpretations. For example, an interesting
interpretation of 3F is offered by Dean and Webb (2011) who suggest this for-
mula as a “treatment” for senior executives (CEOs) to deal with information
overload and multitasking. Executives are offered to cope with the problem
by focusing (doing one thing at a time), filtering (delegating so that they do
not take on too many tasks or too much information), and forgetting (taking
breaks and clearing their minds).

Focusing
Following our Boeing example, let us imagine a data analyst extracting
knowledge tokens from a business news stream and putting these tokens
as missing bits in the mosaic of his mental picture of the world. A tricky
part of his work, guided by intuition or experience in practice, is choosing
the order in which the facts are picked up from the token. Order of focusing
is very important as it influences the formation and saturation of different
fragments in the overall canvas. Even if the same input tokens are given, dif-
ferent curves of focusing may result in different knowledge representations
and analysis outcomes.
A similar aspect of proper focusing is of importance also for automated
processing of Big Data or its semantics. One could speculate whether a pro-
cessing engine should select data tokens or assertions in the order of their
appearance, in a reversed order, or anyhow else. If data or assertions are pro-
cessed in a stream window and in real time, the order of focusing is of lesser
relevance. However, if all the data or knowledge tokens are in a persistent
storage, having some intelligence for optimal focusing may improve process-
ing efficiency substantially. With smart focusing at hand, a useful token can
be found or a hidden pattern extracted much faster and without making a
complete scan of the source data. A complication for smart focusing is that
the nodes on the focusing curve have to be decided upon on-the-fly because
generally the locations of important tokens cannot be known in advance.
Therefore, the processing of a current focal point should not only yield what
26 Big Data Computing

is intended directly of this portion of data, but also hint about the next point
on the curve.
A weak point in such a “problem-solving” approach is that some potentially
valid alternatives are inevitably lost after each choice made on the decision
path. So, only a suboptimal solution is practically achievable. The evolution-
ary approach detailed further in section “Knowledge Self-Management and
Refinement through Evolution” follows, in fact, a similar approach of smart
focusing, but uses a population of autonomous problem-solvers operating
concurrently. Hence, it leaves a much smaller part of a solution space without
attention, reduces the bias of each choice, and likely provides better results.

Filtering
A data analyst who receives dozens of news posts at once has to focus on the
most valuable of them and filter out the rest which, according to his informed
guess, do not bring anything important additionally to those in his focus.
Moreover, it might also be very helpful to filter out noise, that is, irrelevant
tokens, irrelevant dimensions of data, or those bits of data that are unread-
able or corrupted in any other sense. In fact, an answer to the question about
what to trash and what to process needs to be sought based on the under-
standing of the objective (e.g., was the reason for Boeing to hike their market
forecast valid?) and the choice of the proper context (e.g., should we look into
the airline fleets or the economic situation in developing countries?).
A reasonable selection of features for processing or otherwise a rational
choice of the features that may be filtered out may essentially reduce the
volume as well as the variety/complexity of data which result in higher effi-
ciency balanced with effectiveness.
Quite similar to focusing, a complication here is that for big heterogeneous
data it is not feasible to expect a one-size-fits-all filter in advance. Even more,
for deciding about an appropriate filtering technique and the structure of a
filter to be applied, a focused prescan of data may be required, which implies
a decrease in efficiency. The major concern is again how to filter in a smart
way and so as to balance the intentions to reduce processing effort (efficiency)
and keep the quality of results within acceptable bounds (effectiveness).
Our evolutionary approach presented in the section “Knowledge Self-
Management and Refinement through Evolution” uses a system of environ-
mental contexts for smart filtering. These contexts are not fixed but may be
adjusted by several independent evolutionary mechanisms. For example, a
context may become more or less “popular” among the knowledge organ-
isms that harvest knowledge tokens in them because these organisms
may migrate freely between contexts in search for better, more appropri-
ate, healthier knowledge to collect. Another useful property we propose
for knowledge organisms is their resistance to sporadic mutagenic factors,
which may be helpful for filtering out noise.
Toward Evolving Knowledge Ecosystems for Big Data Understanding 27

Forgetting
A professional data analyst always keeps a record of data he used in his work
and the knowledge he created in his previous analyses. The storage for all
these gems of expertise is, however, limited, so it has to be cleaned periodi-
cally. Such a cleaning implies trashing potentially valuable things, though
never or very rarely used, but causing doubts and further regrets about the
lost. Similar thing happens when Big Data storage is overflown—some parts
of it have to be trashed and so “forgotten.” A question in this respect is about
which part of a potentially useful collection may be sacrificed. Is forgetting
the oldest records reasonable?—perhaps not. Shall we forget the features that
have been previously filtered out?—negative again. There is always a chance
that an unusual task for analysis pops up and requires the features never
exploited before. Are the records with minimal potential utility the best can-
didates for trashing?—could be a rational way to go, but how would their
potential value be assessed?
Practices in Big Data management confirm that forgetting following
straightforward policies like fixed lifetime for keeping records causes regret
almost inevitably. For example, the Climate Research Unit (one of the leading
institutions that study natural and anthropogenic climate change and collect
climate data) admits that they threw away the key data to be used in global
warming calculations (Joseph 2012).
A better policy for forgetting might be to extract as much as possible knowl-
edge out of data before deleting these data. It cannot be guaranteed, however,
that future knowledge mining and extraction algorithms will not be capable of
discovering more knowledge to preserve. Another potentially viable approach
could be “forgetting before storing,” that is, there should be a pragmatic reason
to store anything. The approach we suggest in the section “Knowledge Self-
Management and Refinement through Evolution” follows exactly this way.
Though knowledge tokens are extracted from all the incoming data tokens, not
all of them are consumed by knowledge organisms, but only those assertions
that match to their knowledge genome to a sufficient extent. This similarity is
considered a good reason for remembering a fact. The rest remains in the envi-
ronment and dies out naturally after the lifetime comes to end as explained in
“Knowledge Self-Management and Refinement through Evolution”.

Contextualizing
Our reflection of the world is often polysemic, so a pragmatic choice of a
context is often needed for proper understanding. For example, “taking a
mountain hike” or “hiking a market forecast” are different actions though
the same lexical root is used in the words. An indication of a context: recre-
ation or business in this example would be necessary for making the state-
ment explicit. To put it even broader, not only the sense of statements, but
28 Big Data Computing

also judgments, assessments, attitudes, and sentiments about the same data
or knowledge token may well differ in different contexts. When it goes about
data, it might be useful to know:

1. The “context of origin”—the information about the source; who orga-


nized and performed the action; what were the objects; what features
have been measured; what were the reasons or motives for collecting
these data (transparent or hidden); when and where the data were
collected; who were the owners; what were the license, price, etc.
2. The “context of processing”—formats, encryption keys, used prepro-
cessing tools, predicted performance of various data mining algo-
rithms, etc.; and
3. The “context of use”—potential domains, potential or known appli-
cations, which may use the data or the knowledge extracted from it,
potential customers, markets, etc.

Having circumscribed these three different facets of context, we may say


now that data contextualization is a transformation process which decontex-
tualizes the data from the context of origin and recontextualizes it into the
context of use (Thomason 1998), if the latter is known. This transformation is
performed via smart management of the context of processing.
Known data mining methods are capable of automatically separating
the so-called “predictive” and “contextual” features of data instances (e.g.,
Terziyan 2007). A predictive feature stands for a feature that directly influ-
ences the result of applying to data a knowledge extraction instrument—
knowledge discovery, prediction, classification, diagnostics, recognition, etc.

RESULT = INSTRUMENT(Predictive Features).

Contextual features could be regarded as arguments to a meta-function


that influences the choice of appropriate (based on predicted quality/perfor-
mance) instrument to be applied to a particular fragment of data:

INSTRUMENT = CONTEXTUALIZATION(Contextual Features).

Hence, a correct way to process each data token and benefit of contextu-
alization would be: (i) decide, based on contextual features, which would
be an appropriate instrument to process the token; and then (ii) process it
using the chosen instrument that takes the predictive features as an input.
This approach to contextualization is not novel and is known in data mining
and knowledge discovery as a “dynamic” integration, classification, selec-
tion, etc. Puuronen et  al. (1999) and Terziyan (2001) proved that the use of
dynamic contextualization in knowledge discovery yields essential quality
improvement compared to “static” approaches.
Toward Evolving Knowledge Ecosystems for Big Data Understanding 29

Compressing
In the context of Big Data, having data in a compact form is very important for
saving storage space or reducing communication overheads. Compressing is
a process of data transformation toward making data more compact in terms
of required storage space, but still preserving either fully (lossless compres-
sion) or partly (lossy compression) the essential features of these data—those
potentially required for further processing or use.
Compression, in general, and Big Data compression, in particular, are
effectively possible due to a high probability of the presence of repetitive,
periodical, or quasi-periodical data fractions or visible trends within data.
Similar to contextualization, it is reasonable to select an appropriate data
compression technique individually for different data fragments (clusters),
also in a dynamic manner and using contextualization. Lossy compression
may be applied if it is known how data will be used, at least potentially.
So that some data fractions may be sacrificed without losing the facets of
semantics and the overall quality of data required for known ways of its use.
A relevant example of a lossy compression technique for data having quasi-
periodical features and based on a kind of “meta-statistics” was reported by
Terziyan et al. (1998).

Connecting
It is known that nutrition is healthy and balanced if it provides all the neces-
sary components that are further used as building blocks in a human body.
These components become parts of a body and are tightly connected to the
rest of it. Big Data could evidently be regarded as nutrition for knowledge
economy as discussed in “Motivation and Unsolved Issues”. A challenge is
to make this nutrition healthy and balanced for building an adequate mental
representation of the world, which is Big Data understanding. Following the
allusion of human body morphogenesis, understanding could be simplisti-
cally interpreted as connecting or linking new portions of data to the data
that is already stored and understood. This immediately brings us about the
concept of linked data (Bizer et al. 2009), where “linked” is interpreted as a
sublimate of “understood.” We have written “a sublimate” because having
data linked is not yet sufficient, though necessary for further, more intel-
ligent phase of building knowledge out of data. After data have been linked,
data and knowledge mining, knowledge discovery, pattern recognition,
diagnostics, prediction, etc. could be done more effectively and efficiently.
For example, Terziyan and Kaykova (2012) demonstrated that executing busi-
ness intelligence services on top of linked data is noticeably more efficient
than without using linked data. Consequently, knowledge generated out of
linked data could also be linked using the same approach, resulting in the
linked knowledge. It is clear from the Linking Open Data Cloud Diagram
by Richard Cyganiak and Anja Jentzsch (lod-cloud.net) that knowledge
30 Big Data Computing

(e.g., RDF or OWL modules) represented as a linked data can be relatively


easily linked to different public data sets, which creates a cloud of linked
open semantic data.
Mitchell and Wilson (2012) argue that the key to extract value from Big
Data lies in exploiting the concept of linked. They believe that linked data
potentially creates ample opportunities from numerous data sources. For
example, using links between data as a “broker” brings more possibilities
of extracting new data from the old, creating insights that were previously
unachievable, and facilitating exciting new scenarios for data processing.
For developing an appropriate connection technology, the results are rele-
vant from numerous research and development efforts, for example, Linking
Open Data (LOD; w3.org/wiki/SweoIG/TaskForces/CommunityProjects/
LinkingOpenData) project, DBpedia (dbpedia.org), OpenCyc (opencyc.org),
FOAF (foaf-project.org), CKAN (ckan.org), Freebase (freebase.com), Factual
(factual.com), and INSEMTIVES (insemtives.eu/index.php). These projects
create structured and interlinked semantic content, in fact, mashing up the
features from Social and Semantic Web (Ankolekar et al. 2007). One strength
of their approach is that collaborative content development effort is propa-
gated up the level of the data-processing stack which allows creating seman-
tic representations collaboratively and in an evolutionary manner.

Autonomic Big Data Computing


The treatment offered in the “Refining Big Data Semantics Layer for
Balancing Efficiency-Effectiveness” section requires a paradigm shift in Big
Data computing. In seeking for a suitable approach to building processing
infrastructures, a look into Autonomic Computing might be helpful. Started
by International Business Machines (IBM) in 2001, Autonomic Computing
refers to the characteristics of complex computing systems allowing them
to manage themselves without direct human intervention. A human, in
fact, defines only general policies that constrain self-management process.
According to IBM,* the four major functional areas of autonomic computing
are: (i) self-configuration—automatic configuration of system components; (ii)
self-optimization—automatic monitoring and ensuring the optimal function-
ing of the system within defined requirements; (iii) self-protection—automatic
identification and protection from security threats; and (iv) self-healing—
automatic fault discovery and correction. Other important capabilities of
autonomic systems are: self-identity in a sense of being capable of knowing
itself, its parts, and resources; situatedness and self-adaptation—sensing the
influences from its environment and acting accordingly to what happens
in the observed environment and a particular context; being non-proprietary
in a sense of not constraining itself to a closed world but being capable of
functioning in a heterogeneous word of open standards; and anticipatory in

* research.ibm.com/autonomic/overview/elements.html (accessed October 10, 2012).


Toward Evolving Knowledge Ecosystems for Big Data Understanding 31

a sense of being able to automatically anticipate needed resources and seam-


lessly bridging user tasks to their technological implementations hiding
complexity.
However, having an autonomic system for processing Big Data seman-
tics might not be sufficient. Indeed, even such a sophisticated entity system
may once face circumstances which it would not be capable of reacting to
by reconfiguration. So, the design objectives will not be met by such a sys-
tem and it should qualify itself as not useful for further exploitation and die.
A next-generation software system will then be designed and implemented
(by humans) which may inherit some valid features from the ancestor system
but shall also have some principally new features. Therefore, it needs to be
admitted that it is not always possible for even an autonomic system to adapt
itself to a change within its lifetime. Consequently, self-management capabil-
ity may not be sufficient for the system to survive autonomously—humans
are required for giving birth to ancestors. Hence, we are coming to the neces-
sity of a self-improvement feature which is very close to evolution. In that we
may seek for inspiration in bio-social systems. Nature offers an automatic
tool for adapting biological species across generations named genetic evolu-
tion. An evolutionary process could be denoted as the process of proactive
change of the features in the populations of (natural or artificial) life forms
over successive generations providing diversity at every level of life organiza-
tion. Darwin (1859) put the following principles in the core of his theory:

• Principle of variation (variations of configuration and behavioral


features);
• Principle of heredity (a child inherits some features from its parents);
• Principle of natural selection (some features make some individu-
als more competitive than others in getting needed for survival
resources).

These principles may remain valid for evolving software systems, in par-
ticular, for Big Data computing. Processing knowledge originating from Big
Data may, however, imply more complexity due to its intrinsic social features.
Knowledge is a product that needs to be shared within a group so that
survivability and quality of life of the group members will be higher than
those of any individual alone. Sharing knowledge facilitates collaboration
and improves individual and group performance. Knowledge is actively
consumed and also left as a major inheritance for future generations, for
example, in the form of ontologies. As a collaborative and social substance,
knowledge and cognition evolve in a more complex way for which additional
facets have to be taken into account such as social or group focus of attention,
bias, interpretation, explicitation, expressiveness, inconsistency, etc.
In summary, it may be admitted that Big Data is collected and super-
vised by different communities following different cultures, standards,
32 Big Data Computing

objectives,  etc. Big Data semantics is processed using naturally different


ontologies. All these loosely coupled data and knowledge fractions in fact
“live their own lives” based on very complex processes, that is, evolve follow-
ing the evolution of these cultures, their cognition mechanisms, standards,
objectives, ontologies, etc. An infrastructure for managing and understand-
ing such data straightforwardly needs to be regarded as an ecosystem of
evolving processing entities. Below we propose treating ontologies (a key
for understanding Big Data) as genomes and bodies of those knowledge
processing entities. For this, basic principles by Darwin are applied to their
evolution aiming to get optimal or quasi-optimal (according to evolving defi-
nition of the quality) populations of knowledge species. These populations
represent the evolving understanding of the respective islands of Big Data
in their dynamics. This approach to knowledge evolution will require inter-
pretation and implementation of concepts like “birth,” “death,” “morpho-
genesis,” “mutation,” “reproduction,” etc., applied to knowledge organisms,
their groups, and environments.

Scaling with a Traditional Database


In some sense, “Big data” is a term that is increasingly being used to describe
very large volumes of unstructured and structured content—usually in
amounts measured in terabytes or petabytes—that enterprises want to har-
ness and analyze.
Traditional relational database management technologies, which use index-
ing for speedy data retrieval and complex query support, have been hard
pressed to keep up with the data insertion speeds required for big data ana-
lytics. Once a database gets bigger than about half a terabyte, some database
products’ ability to rapidly accept new data start [start is to database prod-
ucts] to decrease.
There are two kinds of scalability, namely vertical and horizontal. Vertical
scaling is just adding more capacity to a single machine. Fundamentally,
every database product is vertically scalable to the extent that they can make
good use of more central processing unit cores, random access memory, and
disk space. With a horizontally scalable system, it is possible to add capacity
by adding more machines. Beyond doubt, most database products are not
horizontally scalable.
When an application needs more write capacity than they can get out of a
single machine, they are required to shard (partition) their data across mul-
tiple database servers. This is how companies like Facebook (facebook.com)
or Twitter (twitter.com) have scaled their MySQL installations to massive
proportions. This is the closest to what one can get into horizontal scalability
with database products.
Toward Evolving Knowledge Ecosystems for Big Data Understanding 33

Sharding is a client-side affair, that is, the database server does not do it
for user. In this kind of environment, when someone accesses data, the data
access layer uses consistent hashing to determine which machine in the clus-
ter a precise data should be written to (or read from). Enhancing capacity to
a sharded system is a process of manually rebalancing the data across the
cluster. The database system itself takes care of rebalancing the data and
guaranteeing that it is adequately replicated across the cluster. This is what
it means for a database to be horizontally scalable.
In many cases, constructing Big Data systems on premise provides better
data flow performance, but requires a greater capital investment. Moreover,
one has to consider the growth of the data. While many model linear growth
curves, interestingly the patterns of data growth within Big Data systems
are more exponential. Therefore, model both technology and costs to match
up with sensible growth of the database so that the growth of the data flows.
Structured data transformation is the traditional approach of changing the
structure of the data found within the source system to the structure of the
target system, for instance, a Big Data system. The advantage of most Big Data
systems is that deep structure is not a requirement; without doubt, structure
can typically be layered in after the data arrive at the goal. However, it is a
best practice to form the data within the goal. It should be a good abstrac-
tion of the source operational databases in a structure that allows those who
analyze the data within the Big Data system to effectively and efficiently find
the data required. The issue to consider with scaling is the amount of latency
that transformations cause as data moves from the source(s) to the goal, and
the data are changed in both structure and content. However, one should
avoid complex transformations as data migrations for operational sources
to the analytical goals. Once the data are contained within a Big Data sys-
tem, the distributed nature of the architecture allows for the gathering of the
proper result set. So, transformations that cause less latency are more suit-
able within Big Data domain.

Large Scale Data Processing Workflows


Overall infrastructure for many Internet companies can be represented as
a pipeline with three layers: Ingestion, Storage & Processing, and Serving.
The most vital among the three is the Storage & Processing layer. This layer
can be represented as a multisub-layer stack with a scalable file system such
as Google File System (Ghemawat et  al. 2003) at the bottom, a framework
for distributed sorting and hashing, for example, Map-Reduce (Dean and
Ghemawat 2008) over the file system layer, a dataflow programming frame-
work over the map-reduce layer, and a workflow manager at the top.
Debugging large-scale data, in the Internet firms, is crucial because
data passes through many subsystems, each having different query inter-
face, different metadata representation, different underlying models (some
have files, some have records, some have workflows), etc. Thus, it is hard
34 Big Data Computing

to maintain consistency and it is essential to factor out the debugging from


the subsystems. There should be a self-governing system that takes care of
all the metadata management. All data-processing subsystems can dispatch
their metadata to such system which absorbs all the metadata, integrates
them, and exposes a query interface for all metadata queries. This can pro-
vide a uniform view to users, factors out the metadata management code,
and decouples metadata lifetime from data/subsystem lifetime.
Another stimulating problem is to deal with different data and process
granularity. Data granularity can vary from a web page, to a table, to a row,
to a cell. Process granularity can vary from a workflow, to a map-reduce pro-
gram, to a map-reduce task. It is very hard to make an inference when the
given relationship is in one granularity and the query is in other granular-
ity and therefore it is vital to capture provenance data across the workflow.
While there is no one-size-fits-all solution, a good methodology could be
to use the best granularity at all levels. However, this may cause a lot of
overhead and thus some smart domain-specific techniques need to be imple-
mented (Lin and Dyer 2010; Olston 2012).

Knowledge Self-Management and Refinement through


Evolution
World changes—so do the beliefs and reflections about it. Those beliefs
and reflections are the knowledge humans have about their environments.
However, the nature of those changes is different. The world just changes in
events. Observation or sensing (Ermolayev et al. 2008) of events invokes gen-
eration of data—often in huge volumes and with high velocities. Humans
evolve—adapt themselves to become better fitted to the habitat.
Knowledge is definitely a product of some processes carried out by con-
scious living beings (for example, humans). Following Darwin’s (1859)
approach and terminology to some extent, it may be stated that knowledge,
both in terms of scope and quality, makes some individuals more competi-
tive than others in getting vital resources or at least for improving their qual-
ity of life. The major role of knowledge as a required feature for survival is
decision-making support. Humans differ in fortune and fate because they
make different choices in similar situations, which is largely due to their pos-
session of different knowledge. So, the evolution of conscious beings notice-
ably depends on the knowledge they possess. On the other hand, making a
choice in turn triggers the production of knowledge by a human. Therefore,
it is natural to assume that knowledge evolves triggered by the evolution of
conscious beings, their decision-making needs and taken decisions, quality
standards, etc. To put both halves in one whole, knowledge evolves in sup-
port of and to support the proactive needs of the owners more effectively, for
Toward Evolving Knowledge Ecosystems for Big Data Understanding 35

example, to better interpret or explain the data generated when observing


events, corresponding to the diversity and complexity of these data. This
observation leads us to a hypothesis about the way knowledge evolves:

The mechanisms of knowledge evolution are very similar to the mecha-


nisms of biological evolution. Hence, the methods and mechanisms for
the evolution of knowledge could be spotted from the ones enabling the
evolution of living beings.

In particular, investigating the analogies and developing the mechanisms


for the evolution of formal knowledge representations—specified as ontolo-
gies—is of interest for the Big Data semantics layer (Figure 1.5). The triggers
for ontology evolution in the networked and interlinked environments could
be external influences coming bottom-up from external and heterogeneous
information streams.
Recently, the role of ontologies as formal and consensual knowledge rep-
resentations has become established in different domains where the use
of knowledge representations and reasoning is an essential requirement.
Examples of these domains range from heterogeneous sensor network data
processing through the Web of Things to Linked Open Data management
and use. In all these domains, distributed information artifacts change spo-
radically and intensively in reflection of the changes in the world. However,
the descriptions of the knowledge about these artifacts do not evolve in line
with these changes.
Typically, ontologies are changed semiautomatically or even manually and
are available in a sequence of discrete revisions. This fact points out a seri-
ous disadvantage of ontologies built using state-of-the-art knowledge engi-
neering and management frameworks and methodologies: expanding and
amplified distortion between the world and its reflection in knowledge. It is
also one of the major obstacles for a wider acceptance of semantic technolo-
gies in industries (see also Hepp 2007; Tatarintseva et al. 2011).
The diversity of domain ontologies is an additional complication for proper
and efficient use of dynamically changing knowledge and information arti-
facts for processing Big Data semantics. Currently, the selection of the best
suiting one for a given set of requirements is carried out by a knowledge
engineer using his/her subjective preferences. A more natural evolutionary
approach for selecting the best-fitting knowledge representations promises
enhancing robustness and transparency, and seems to be more technologi-
cally attractive.
Further, we elaborate a vision of a knowledge evolution ecosystem where
agent-based software entities carry their knowledge genomes in the form of
ontology schemas and evolve in response to the influences percieved from
their environments. These influences are thought of as the tokens of Big Data
(like news tokens in the “Illustrative Example” section) coming into the spe-
cies’ environments. Evolution implies natural changes in the ontologies which
36 Big Data Computing

reflect the change in the world snap-shotted by Big Data tokens. Inspiration
and analogies are taken from evolutionary biology.

Knowledge Organisms, their Environments, and Features


Evolving software entities are further referred to as individual Knowledge
Organisms (KO). It is envisioned (Figure 1.6) that a KO:

1. Is situated in its environment as described in “Environment, Perception


(Nutrition), and Mutagens”
2. Carries its individual knowledge genome represented as a schema or
Terminological Box (TBox; Nardi and Brachman 2007) of the respec-
tive ontology (see “Knowledge Genome and Knowledge Body”)
3. Has its individual knowledge body represented as an assertional com-
ponent (ABox; Nardi and Brachman 2007) of the respective ontology
(see “Knowledge Genome and Knowledge Body”)
4. Is capable of perceiving the influences from the environment in the
form of knowledge tokens (see “Environment, Perception (Nutrition),
and Mutagens”) that may cause the changes in the genome (see
“Mutation”) and body (see “Morphogenesis”)—the mutagens
5. Is capable of deliberating about the affected parts of its genome and
body (see “Morphogenesis” and “Mutation”)

Mutagen
KO KO
(TBox)
(ABox) Perception Communication
Sensor input

Deliberation
Genome
(TBox)
Morphogenesis
Body
(ABox)
Mutation
Reproduction Action output
Recombination
Excretion
KO

Environment

Figure 1.6
A Knowledge Organism: functionality and environment. Small triangles of different trans-
parency represent knowledge tokens in the environment—consumed and produced by KOs.
These knowledge tokens may also referred to as mutagens as they may trigger mutations.
Toward Evolving Knowledge Ecosystems for Big Data Understanding 37

6. Is capable of consuming some parts of a mutagen for: (a) morpho-


genesis changing only the body (see “Morphogenesis”); (b) mutation
changing both the genome and body (see “Mutation”); or (c) recom-
bination—a mutual enrichment of several genomes in a group of KO
which may trigger reproduction—recombination of body replicas
giving “birth” to a new KO (see “Recombination and Reproduction”)
7. Is capable of excreting the unused parts of mutagens or the “dead”
parts of the body to the environment

The results of mainstream research in distributed artificial intelligence and


semantic technologies suggest the following basic building blocks for devel-
oping a KO. The features of situatedness (Jennings 2000) and deliberation
(Wooldridge and Jennings 1995) are characteristic to intelligent software agents,
while the rest of the required functionality could be developed using the achiev-
ments in Ontology Alignment (Euzenat and Shvaiko 2007). Recombination
involving a group of KOs could be thought of based on the known mechanisms
for multiissue negotiations on semantic contexts (e.g., Ermolayev et  al. 2005)
among software agents—the members of a reproduction group.

Environment, Perception (Nutrition), and Mutagens


An environmental context for a KO could be thought of as an arial of its habi-
tat. Such a context needs to be able to provide nutrition that is “healthy” for
particular KO species, that is, matching their genome noticeably. The food
for nutrition is provided by Knowledge Extraction and Contextualization
functionality (Figure 1.7) in a form of knowledge tokens. Hence, several

Airline Business

Plane Maker Business


Another Domain
Environmental
Contexts
Knowledge token

has
basedIn Boeing: PlaneMaker

by
hikes
New20YMarketForecast

baseOf hikedBy

UnitedStates Old20YMarketForecast : MarketForecast

Another News Stream


Posted: Tue, 03 Jul 2012 05:01:10 -0400
Information token

LONDON (Reuters)
U.S. planemaker Boeing hiked its 20-year
market forecast, predicting demand for
34,000 new aircraft worth $4.5 trillion, on Business News Stream
growth in emerging regions and as airlines
seek efficient new planes to counter high
fuel costs.

Knowledge Extraction
and Contextualization

Figure 1.7
Environmental contexts, knowledge tokens, knowledge extraction, and contextualization.
38 Big Data Computing

and possibly overlapping environmental contexts need to be regarded in


a ­hierarchy which corresponds to several subject domains of intetrest and
a foundational knowledge layer. By saying this, we subsume that there is a
single domain or foundational ontology module schema per environmental
context. Different environmental contexts corresponding to different subject
domais of interest are pictured as ellipses in Figure 1.7.
Environmental contexts are sowed with knowledge tokens that corre-
spond to their subject domains. It might be useful to limit the lifetime of a
knowledge token in an environment—those which are not consumed dis-
solve finally when their lifetime ends. Fresh and older knowledge tokens are
pictured with different transparency in Figure 1.7.
KOs inhabit one or several overlapping environmental contexts based
on the nutritional healthiness of knowledge tokens sowed there, that is,
the degree to which these knowledge tokens match to the genome of a
particular KO. KOs use their perceptive ability to find and consume
knowledge tokens for nutrition. A KO may decide to migrate from one
environment to another based on the availability of healthy food there.
Knowledge tokens that only partially match KOs’ genome may cause both
KO body and genome changes and are thought of as mutagens. Mutagens,
in fact, deliver the information about the changes in the world to the envi-
ronments of KOs.
Knowledge tokens are extracted from the information tokens either in
a stream window or from the updates of the persistent data storage and
further sawn in the appropriate environmental context. The context for
placing a newly coming KO is chosen by the contextualization functional-
ity (Figure 1.7) based on the match ratio to the ontology schema character-
izing the context in the environment. Those knowledge tokens that are not
mapped well to any of the ontology schemas are sown in the environment
without attributing them to any particular context.
For this, existing shallow knowledge extraction techniques could be
exploited, for example, Fan et  al. (2012a). The choice of appropriate tech-
niques depends on the nature and modality of data. Such a technique
would extract several interrelated assertions from an information token
and provide these as a knowledge token coded in a knowledge representa-
tion language of an appropriate expressiveness, for example, in a tractable
subset of the Web Ontology Language (OWL) 2.0 (W3C 2009). Information
and knowledge* tokens for the news item of our Boeing example are pic-
tured in Figure 1.7.

* Unified Modeling Language (UML) notation is used for picturing the knowledge token in
Figure 1.7 because it is more illustrative. Though not shown in Figure 1.7, it can be straight-
forwardly coded in OWL, following, for example, Kendall et al. (2009).
Toward Evolving Knowledge Ecosystems for Big Data Understanding 39

Knowledge Genome and Knowledge Body


Two important aspects in contextualized knowledge representation for an
outlined knowledge evolution ecosystem have to be considered with care
(Figure 1.8):

• A knowledge genome etalon for a population of KOs belonging to


one species
• An individual knowledge genome and body for a particular KO

A knowledge genome etalon may be regarded as the schema (TBox) of


a distinct ontology module which represents an outstanding context in a
subject domain. In our proposal, the etalon genome is carried by a dedicated
Etalon KO (EKO; Figure 1.8) to enable alignments with individual genomes
and other etalons in a uniform way. The individual assertions (ABox) of this
ontology module are spread over the individual KOs belonging to the cor-
responding species—forming their individual bodies.
The individual genomes of those KOs are the recombined genomes of the
KOs who gave birth to this particular KO. At the beginning of times, the
individual genomes may be replicas of the etalon genome. Anyhow, they
evolve independently in mutations or because of morphogenesis of an indi-
vidual KO, or because of recombinations in reproductive groups.

EKO

C1 Species
Genome
Etalon


KOa
KOb
C1
C1
Genome
Genome
Body

∅ Body

Environmental Contexts

Figure 1.8
Knowledge genomes and bodies. Different groups of assertions in a KO body are attributed to
different elements of its genome, as shown by dashed arrows. The more assertions relate to a
genome element, the more dominant this element is as shown by shades of gray.
40 Big Data Computing

Different elements (concepts, properties, axioms) in a knowledge genome


may possess different strengths, that is, be dominant or recessive. For exam-
ple (Figure 1.8) concept C1 in the genome of KOa is quite strong because it
is reinforced by a significant number of individual assertions attributed to
this concept, that is, dominant. On the contrary, C1 in the genome of KOb is
very weak—that is, recessive—as it is not supported by individual asser-
tions in the body of KOb. Recessivness or dominance values may be set
and altered using techniques like spreading activation (Quillian 1967, 1969;
Collins and Loftus 1975) which also appropriately affect the structural con-
texts (Ermolayev et al. 2005, 2010) of the elements in focus.
Recessive elements may be kept in the genome as parts of the genetic mem-
ory, but until they do not contradict any dominant elements. For example, if
a dominant property of the PlaneMaker concept in a particular period of time
is PlaneMaker–hikes–MarketForecast, then a recessive property PlaneMaker–
lessens–MarketForecast may die out soon with high probability, as contradic-
tory to the corresponding dominant property.
The etalone genome of a species evolves in line with the evolution of the indi-
vidual genomes. The difference, however, is that EKO has no direct relation-
ship (situatedness) to any environmental context. So, all evolution influences
are provided to EKO by the individual KOs belonging to the corresponding
species via communication. If an EKO and KOs are implemented as agent-
based software entities, the techniques like agent-based ontology alignment
are of relevance for evolving etalon genomes. In particular, the alignment
settings are similar to a Structural Dynamic Uni-directional Distributed
(SDUD) ontology alignment problem (Ermolayev and Davidovsky 2012).
The problem could be solved using multiissue negotiations on semantic
contexts, for example, following the approach of Ermolayev et al. (2005) and
Davidovsky et al. (2012). For assuring the consistency in the updated ontol-
ogy modules after alignment, several approaches are applicable: incremental
updates for atomic decompositions of ontology modules (Klinov et al. 2012);
checking correctness of ontology contexts using ontology design patterns
approach (Gangemi and Presutti 2009); evaluating formal correctness using
formal (meta-) properties (Guarino and Welty 2001).
An interesting case would be if an individual genome of a particular KO
evolves very differently to the rest of KOs in the species. This may happen
if such a KO is situated in an environmental context substantially different
from the context where the majority of the KOs of this species are collecting
knowledge tokens. For example, the dominancy and recessiveness values in
the genome of KOb (Figure 1.8) differ noticeably from those of the genomes
of the KOs similar to KOa. A good reason for this may be: KOb is situated in
an environmental context different to the context of KOa —so the knowledge
tokens KOb may consume are different to the food collected by KOa. Hence,
the changes to the individual genome of KOb will be noticeably different to
those of KOa after some period of time. Such a genetic drift may cause that the
structural difference in individual genomes goes beyond a threshold within
Toward Evolving Knowledge Ecosystems for Big Data Understanding 41

which recombination gives ontologically viable posterity. A new knowledge


genome etalon may, therefore, emerge if the group of the KOs with genomes
drifted in a similar direction reaches a critical mass—giving birth to a new
species.
The following are the features required to extend an ontology representa-
tion language for to cope with the mentioned evolutionary mechanisms:

• A temporal extension that allows representing and reasoning about


the lifetime and temporal intervals of validity of the elements in
knowledge genomes and bodies. One relevant extension and rea-
soning technique is OWL-MET (Keberle 2009).
• An extension that allows assigning meta-properties to ontological
elements for verifying formal correctness or adherence to relevant
design patterns. Relevant formalisms may be sought following
Guarino and Welty (2001) or Gangemi and Presutti (2009).

Morphogenesis
Morphogenesis in a KO could be seen as a process of developing the shape
of a KO body. In fact, such a development is done by adding new assertions
to the body and attributing them to the correct parts of the genome. This
process could be implemented using ontology instance migration technique
(Davidovsky et al. 2011); however, the objective of morphogenesis differs from
that of ontology instance migration. The task of the latter is to ensure correct-
ness and completeness, that is, that, ideally, all the assertions are properly
aligned with and added to the target ontology ABox. Morphogenesis requires
that only the assertions that fit well to the TBox of the target ontology are
consumed for shaping it out. Those below the fitness threshold are excreted.
If, for example, a mutagen perceived by a KO is the one of our Boeing example
presented in Figures 1.2 or 1.7, then the set of individual assertions will be*

{AllNipponAirways:Airline, B787-JA812A:EfficientNewPlane,
Japan:Country, Boeing:PlaneMaker, New20YMarketForecastbyBoeing:Mark
etForecast, United States:Country, Old20YMarketForecastbyBoeing:MarketF
orecast}. (1.1)

Let us now assume that the genome (TBox) of the KO contains only the con-
cepts represented in Figure 1.2 as grey-shaded classes—{Airline, PlaneMaker,
MarketForecast} and thick-line relationships—{seeksFor–soughtBy}. Then
only the bolded assertions from (1.1) could be consumed for morphogen-
esis by this KO and the rest have to be excreted back to the environment.
Interestingly, the ratio of mutagen ABox consumption may be used as a good

* The syntax for representing individual assertions is similar to the syntax in UML for compat-
ibility with Figure 1.2: 〈assertion-name〉:〈concept-name〉.
42 Big Data Computing

metric for a KO in deliberations about: its resistance to mutations; desire to


migrate to a different environmental context, or to start seeking for reproduc-
tion partners.
Another important case in a morphogenesis process is detecting a con-
tradiction between a newly coming mutagenic assertion and the asser-
tion that is in the body of the KO. For example, let us assume that the
body already comprises the property SalesVolume of the assertion named
New20YMarketForecastbyBoeing with the value of 2.1 million. The value of the
same property coming with the mutagen equals to 4.5 million. So, the KO
has to resolve this contradiction by: either (i) deciding to reshape its body
by accepting the new assertion and excreting the old one; or (ii) resisting
and declining the change. Another possible behavior would be collecting
and keeping at hand the incoming assertions until their dominance is not
proved by the quantity. Dominance may be assessed using different metrics.
For example, a relevant technique is offered by the Strength Value-Based
Argumentation Framework (Isaac et al. 2008).

Mutation
Mutation of a KO could be understood as the change of its genome caused
by the environmental influences (mutagenic factors) coming with the con-
sumed knowledge tokens. Similar to the biological evolution, a KO and its
genome are resistent to mutagenic factors and do not change at once because
of any incoming influence, but only because of those which could not be
ignored because of their strength. Different genome elements may be dif-
ferently resistant. Let us illustrate different aspects of mutation and resis-
tance using our Boeing example. As depicted in Figure 1.9, the change of the
AirPlaneMaker concept name (to PlaneMaker) in the genome did not happen
though a new assertion had been added to the body as a result of morpho-
genesis (Boeing: (PlaneMaker) AirPlaneMaker*). The reason AirPlaneMaker con-
cept resisted this mutation was that the assertions attributed to the concept
of PlaneMaker were in the minority—so, the mutagenic factor has not yet
been strong enough. This mutation will have a better chance to occur if simi-
lar mutagenic factors continue to come in and the old assertions in the body
of the KO die out because their lifetime periods come to end. More generally,
the more individual assertions are attributed to a genome element at a given
point in time—the more strong this genome element is to mutations.
In contrast to the AirPlaneMaker case, the mutations brought by hikes—
hikedBy and successorOf—predecessorOf object properties did happen
(Figure 1.9) because the KO did not possess any (strong) argument to resist
* UML syntax is used as basic. The name of the class from the knowledge token is added in
brackets before the name of the class to which the assertion is attributed in the KO body. This
is done for keeping the information about the occurrences of a different name in the incom-
ing knowledge tokens. This historical data may further be used for evaluating the strength
of the mutagenic factor.
Toward Evolving Knowledge Ecosystems for Big Data Understanding 43

them. Indeed, there were no contradictory properties both in the genome


and the body of the KO before it accepted the grey-shaded assertions as a
result of morphogenesis.
Not all the elements of an incoming knowledge token could be consumed
by a KO. In our example (Figure 1.9), some of the structural elements (AirLine,
EfficientNewPlane, seeks—soughtBy) were

• Too different to the genome of this particular KO, so the similarity


factor was too low and the KO did not find any match to its TBox.
Hence, the KO was not able to generate any replacement hypotheses
also called propositional substitutions (Ermolayev et al. 2005).
• Too isolated from the elements of the genome—having no properties
relating them to the genome elements. Hence, the KO was not able to
generate any merge hypotheses.

These unused elements are excreted (Figure 1.9) back to the environment
as a knowledge token. This token may further be consumed by another

* Country Mutating KO
-baseOf

hikes successorOf
- mutations hikedBy predecessorOf
-basedIn AirPlaneMaker MarketForecast
-has -by -SalesVolume Genome
*
* *
*
5YMarketForcastbyBoeing2010 : MarketForecast
- morphogeneses Boeing : AirPlaneMaker *
Boeing (PlaneMaker) : AirPlaneMaker hikes
Boeing : AirPlaneMaker New20YMarketForecastbyBoeing : MarketForecast
Boe Boeing (PlaneMaker) : AirPlaneMaker successorOf
- irrelevant hikedby Old20YMarketForecastbyBoeing : MarketForecast

and excreted AirBus: AirPlaneMaker


elements AirBus: AirPlaneMaker
AirBus: AirPlaneMaker predecessorOf
AirBus: AirPlaneMaker
*
UnitedStates : Country Body
*
Perception Canada : Country
Excretion
*
Brasil : Country

PlaneMaker MarkerForecast
Consumed -has -by -SalesVolume Excreted
knowledge * * Airline knowledge
token token
Airline EfficientNewPlane
-fuelConsumption : <unspecified> = low
-built : <unspecified> = >2009
* -seeksFor
-delivered : Date * -soughtBy

EfficientNewPlane
* -seeksFor
-fuelConsumption : <unspecified> = low
-soughtBy * Genome -built : <unspecified> = >2009
-delivered : Date
has by Body
basedIn Boeing : PlaneMaker New20YMarketForecastbyBoeing : MarketForecast Genome
SalesVolume = 4.5 trillion

baseOf hikes successorOf


predecessorOf
UnitedStates : Country hikedBy Old20YMarketForecastbyboeing : MarketForecast
SalesVolume

Figure 1.9
Mutation in an individual KO illustrated by our Boeing example.
44 Big Data Computing

KO with different genome comprising matching elements. Such a KO may


migrate from a different environmental context (e.g., Airlines Business).
Similar to morphogenesis, mutation may be regarded as a subproblem of
ontology alignment. The focus is, however, a little bit different. In contrast
to morphogenesis which was interpreted as a specific ontology instance
migration problem, mutation affects the TBox and is therefore structural
ontology alignment (Ermolayev and Davidovsky 2012). There is a solid body
of related work in structural ontology alignment. Agent-based approaches
relevant to our context are surveyed, for example, in Ermolayev and
Davidovsky (2012).
In addition to the requirements already mentioned above, the following
features extending an ontology representation language are essential for
coping with the mechanisms of mutation:

• The information of the attribution of a consumed assertion to a par-


ticular structural element in the knowledge token needs to be pre-
served for future use in possible mutations. An example is given
in Figure 1.8—Boeing: (PlaneMaker) AirPlaneMaker. The name of the
concept in the knowledge token (PlaneMaker) is preserved and the
assertion is attributed to the AirPlaneMaker concept in the genome.

Recombination and Reproduction


As mutation, recombination is a mechanism of adapting KOs to environ-
mental changes. Recombination involves a group of KOs belonging to one
or several similar species with partially matching genomes. In contrast to
mutation, recombination is triggered and performed differently. Mutation
is invoked by external influences coming from the environment in the form
of mutagens. Recombination is triggered by a conscious intention of a KO
to make its genome more resistant and therefore better adapted to the envi-
ronment in its current state. Conscious in this context means that a KO first
analyzes the strength and adaptation of its genome, detects weak elements,
and then reasons about the necessity of acquiring external reinforcements
for these weak elements. Weaknesses may be detected by:

• Looking at the proportion of consumed and excreted parts in the


perceived knowledge tokens—reasoning about how healthy is the
food in its current environmental context. If not, then new elements
extending the genome for increasing consumption and decreasing
excretion may be desired to be acquired.
• Looking at the resistance of the elements in the genome to muta-
tions. If weaknesses are detected, then it may be concluded that the
assertions required for making these structural elements stronger
are either nonexistent in the environmental context or are not con-
sumed. In the latter case, a structural reinforcement by acquiring
Toward Evolving Knowledge Ecosystems for Big Data Understanding 45

new genome elements through recombination may be useful. In the


former case (nonexistence), the KO may decide to move to a different
environmental context.

Recombination of KOs as a mechanism may be implemented using sev-


eral available technologies. Firstly, a KO needs to reason about the strengths
and weaknesses of the elements in its genome. For this, in addition to the
extra knowledge representation language features mentioned above, it needs
a simple reasoning functionality (pictured in Figure 1.6 as Deliberation).
Secondly, a KO requires a means for getting in contact with the other KOs
and checking if they have similar intentions to recombine their genomes. For
this, the available mechanisms for communication (e.g., Labrou et al. 1999;
Labrou 2006), meaning negotiation (e.g., Davidovsky et al. 2012), and coali-
tion formation (e.g., Rahwan 2007) could be relevant.
Reproduction is based on recombination mechanism and results and goes
further by combining the replicas of the bodies of those KOs who take part
in the recombination group resulting in the production of a new KO. A KO
may intend to reproduce itself because his lifetime period comes to an end or
because of the other individual or group stimuli that have to be researched.

Populations of Knowledge Organisms


KOs may belong to different species—the groups of KOs that have similar
genomes based on the same etalon carried by the EKO. KOs that share the
same areal of habitat (environmental context) form the population which may
comprise the representatives of several species. Environmental contexts may
also overlap. So, the KOs of different species have possibilities to interact.
With respect to species and populations, the mechanisms of (i) migration, (ii)
genetic drift, (iii) speciation, and (iv) breeding for evolving knowledge represen-
tations are of interest.
Migration is the movement of KOs from one environmental context to
another context because of different reasons mentioned in the “Knowledge
Organisms, their Environments, and Features” section. Genetic drift is the
change of genomes to a degree beyond the species tolerance (similarity)
threshold caused by cumulative efffect of a series of mutations as explained
in the “Knowledge Genome and Knowledge Body” section. Speciation effect
occurs if genetic drift results in a distinct group of KOs capable of reproduc-
ing themselves with their recombined genomes.
If knowledge evolves in a way similar to biological evolution, the out-
come of this process would best-fit KOs desires of environmental mimicry,
but perhaps not the requirements of ontology users. Therefore, for ensuring
human stakeholders’ commitment to the ontology, it might be useful to keep
the evolution process under control. For this, constraints, or restrictions in
another form, may be introduced for relevant environmental contexts and
fitness measurement functions so as to guide the evolution toward a desired
46 Big Data Computing

goal. This artificial way of control over the natural evolutionary order of
things may be regarded as breeding—a controlled process of sequencing
desired mutations that causes the emergence of a species with the required
genome features.

Fitness of Knowledge Organisms and Related Ontologies


It has been repeatedly stated in the discussion of the features of KOs in
“Knowledge Organisms, their Environments, and Features” that they exhibit
proactive behavior. One topical case is that a KO would rather migrate away
from the current environmental context instead of continuing consuming
knowledge tokens which are not healthy for it in terms of structural simi-
larity to its genome. It has also been mentioned that a KO may cooperate
with other KOs to fulfill its evolutionary intentions. For instance, KOs may
form cooperative groups for recombination or reproduction. They also
interact with their EKOs for improving the etalon genome of the species.
Another valid case, though not mentioned in “Knowledge Organisms, their
Environments, and Features”, would be if a certain knowledge token is avail-
able in the environment and two or more KOs approach it concurrently with
an intention to consume. If those KOs are cooperative, the token will be con-
sumed by the one which needs it most—so that the overall “strength” of the
species is increased. Otherwise, if the KOs are competitive, as it often hap-
pens in nature, the strongest KO will get the token. All these cases require
a quantification of the strength, or fitness, of KOs and knowledge tokens.
Fitness is, in fact, a complex metric having several important facets.
Firstly, we summarize what fitness of a KO means. We outline that their fit-
ness is inseparable from (in fact, symmetric to) the fitness of the knowledge
tokens that KOs consume from and excrete back to their environmental con-
texts. Then, we describe several factors which contribute to fitness. Finally,
we discuss how several dimensions of fitness could be used to compare dif-
ferent KOs.
To start our deliberations about fitness, we have to map the high-level
understanding of this metric to the requirements of Big Data processing as
presented in the “Motivation and Unsolved Issues” and “State of Technology,
Research, and Development in Big Data Computing” sections in the form
of the processing stack (Figures 1.3 through 1.5). The grand objective of a
Big Data computing system or infrastructure is providing a capability for
data analysis with balanced effectiveness and efficiency. In particular, this
capability subsumes facilitating decision-making and classification, provid-
ing adequate inputs to software applications, etc. An evolving knowledge
ecosystem, comprising environmental contexts populated with Kos, is intro-
duced in the semantics processing layer of the overall processing stack. The
aim of introducing the ecosystem is to ensure seamless and balanced con-
nection between a user who operates the system at the upper layers and the
lower layers that provide data.
Toward Evolving Knowledge Ecosystems for Big Data Understanding 47

Ontologies are the “blood and flesh” of the KOs and the whole ecosys-
tem as they are both the code registering a desired evolutionary change and
the result of this evolution. From the data-processing viewpoint, the ontolo-
gies are consensual knowledge representations that facilitate improving
data integration, transformation, and interoperability between the process-
ing nodes in the infrastructure. A seamless connection through the layers
of the processing stack is facilitated by the way ontologies are created and
changed. As already mentioned above in the introduction of the “Knowledge
Self-Management and Refinement through Evolution” section, ontologies
are traditionally designed beforehand and further populated by assertions
taken from the source data. In our evolving ecosystem, ontologies evolve in
parallel to data processing. Moreover, the changes in ontologies are caused
by the mutagens brought by the incoming data. Knowledge extraction sub-
system (Figure 1.7) transforms units of data to knowledge tokens. These in
turn are sown in a corresponding environmental context by a contextualiza-
tion subsystem and further consumed by KOs. KOs may change their body
or even mutate due to the changes brought by consumed mutagenic knowl-
edge tokens. The changes in the KOs are in fact the changes in the ontologies
they carry. So, ontologies change seamlessly and naturally in a way to best
suite the substance brought in by data. For assessing this change, the judg-
ments about the value and appropriateness of ontologies in time are impor-
tant. Those should, however, be formulated accounting for the fact that an
ontology is able to self-evolve.
A degree to which an ontology is reused is one more important character-
istic to be taken into account. Reuse means that data in multiple places refers
to this ontology and when combined with interoperability it implies that
data about similar things is described using the same ontological fragments.
When looking at an evolving KO, having a perfect ontology would mean that
if new knowledge tokens appear in the environmental contexts of an organ-
ism, the organism can integrate all assertions in the tokens, that is, without
a need to excrete some parts of the consumed knowledge tokens back to
the environment. That is to say, the ontology which was internal to the KO
before the token was consumed was already prepared for the integration of
the new token. Now, one could turn the viewpoint by saying that the infor-
mation described in the token was already described in the ontology which
the KO had and thus that the ontology was reused in one more place. This
increases the value, that is, the fitness of the ontology maintained by the KO.
Using similar argumentation, we can conclude that if a KO needs to excrete
a consumed knowledge token, the ontology fits worse to describing the frag-
ment of data to which the excreted token is attributed. Thus, in conclusion,
we could say that the fitness of a KO is directly dependent on the propor-
tion between the parts of knowledge tokens which it: (a) is able to consume
for morphogenesis and possibly mutation; versus (b) needs to excrete back
to the environment. Additionally, the age of the assertions which build up
the current knowledge body of a KO influences its quality. If the proportion
48 Big Data Computing

of very young assertions in the body is high, the KO might be not resistant
to stochastic changes, which is not healthy. Otherwise, if only long-living
assertions form the body, it means that the KO is either in a wrong context
or too resistant to mutagens. Both are bad as no new information is added,
the KO ignores changes, and hence the ontology it carries may become irrel-
evant. Therefore, a good mix of young and old assertions in the body of
a KO indicates high fitness—KO’s knowledge is overall valid and evolves
appropriately.
Of course stating that fitness depends only on the numbers of used and
excreted assertions is an oversimplification. Indeed, incoming knowledge
tokens that carry assertions may be very different. For instance, the knowl-
edge token in our Boeing example contains several concepts and properties
in its TBox: a Plane, a PlaneMaker, a MarketForecast, an Airline, a Country,
SalesVolume, seeksFor—soughtBy, etc. Also, some individuals attrib-
uted to these TBox elements are given in the ABox: UnitedStates, Boeing,
New20YMarketForecastByBoeing, 4.5 trillion, etc. One can imagine a less
complex knowledge token which contains less information. In addition to
size and complexity, a token has also other properties which are important to
consider. One is the source where the token originates from. A token can be
produced by knowledge extraction from a given channel or can be excreted
by a KO. When the token is extracted from a channel, its value depends on
the quality of the channel, relative to the quality of other channels in the
system (see also the context of origin in the “Contextualizing” section). The
quality of knowledge extraction is important as well, though random errors
could be mitigated by statistical means. Further, a token could be attributed
to a number of environmental contexts. A context is important, that is, adds
more value to a token in the context if there are a lot of knowledge tokens in
that context or more precisely there have appeared many tokens in the con-
text recently. Consequently, a token becomes less valuable along its lifetime
in the environment.
Till now, we have been looking at different fitness, value, and quality fac-
tors in insulation. The problem is, however, that there is no straightforward
way to integrate these different factors. For this, an approach to address the
problem of assessing the quality of an ontology as a dynamic optimization
problem (Cochez and Terziyan 2012) may be relevant.

Some Conclusions
For all those who use or process Big Data a good mental picture of the world,
dissolved in data tokens, may be worth of petabytes of raw information
and save weeks of analytic work. Data emerge reflecting a change in the
world. Hence, Big Data is a fine-grained reflection of the changes around
Toward Evolving Knowledge Ecosystems for Big Data Understanding 49

us. Knowledge extracted from these data in an appropriate and timely way
is an essence of adequate understanding of the change in the world. In this
chapter, we provided the evidence that numerous challenges stand on the
way of understanding the sense, the trends dissolved in the petabytes of
Big Data—extracting its semantics for further use in analytics. Among those
challenges, we have chosen the problem of balancing between effectiveness
and efficiency in understanding Big Data as our focus. For better explaining
our motivation and giving a reader the key that helps follow how our prem-
ises are transformed into conclusions, we offered a simple walkthrough
example of a news token.
We began the analysis of Big Data Computing by looking at how the
phenomenon influences and changes industrial landscapes. This overview
helped us figure out that the demand in industries for effective and efficient
use of Big Data, if properly understood, is enormous. However, this demand
is not yet fully satisfied by the state-of-the-art technologies and methodolo-
gies. We then looked at current trends in research and development in order
to narrow the gaps between the actual demand and the state of the art. The
analysis of the current state of research activities resulted in pointing out the
shortcomings and offering an approach that may help understand Big Data
in a way that balances effectiveness and efficiency.
The major recommendations we elaborated for achieving the balance are: (i)
devise approaches that intelligently combine top-down and bottom-up pro-
cessing of data semantics by exploiting “3F + 3Co” in dynamics, at run time;
(ii) use a natural incremental and evolutionary way of processing Big Data
and its semantics instead of following a mechanistic approach to scalability.
Inspired by the harmony and beauty of biological evolution, we further
presented our vision of how these high-level recommendations may be
approached. The “Scaling with a Traditional Database” section offered a
review of possible ways to solve scalability problem at data processing level.
The “Knowledge Self-Management and Refinement through Evolution” sec-
tion presented a conceptual level framework for building an evolving ecosys-
tem of environmental contexts with knowledge tokens and different species
of KOs that populate environmental contexts and collect knowledge tokens
for nutrition. The genomes and bodies of these KOs are ontologies describing
corresponding environmental contexts. These ontologies evolve in line with
the evolution of KOs. Hence they reflect the evolution of our understanding
of Big Data by collecting the refinements of our mental picture of the change
in the world. Finally, we found out that such an evolutionary approach to
building knowledge representations will naturally allow assuring fitness of
knowledge representations—as the fitness of the corresponding KOs to the
environmental contexts they inhabit.
We also found out that the major technological components for building
such evolving knowledge ecosystems are already in place and could be effec-
tively used, if refined and combined as outlined in the “Knowledge Self-
Management and Refinement through Evolution” section.
50 Big Data Computing

Acknowledgments
This work was supported in part by the “Cloud Software Program” man-
aged by TiViT Oy and the Finnish Funding Agency for Technology and
Innovation (TEKES).

References
Abadi, D. J., D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker,
N. Tatbul, and S. Zdonik. 2003. Aurora: A new model and architecture for data
stream management. VLDB Journal 12(2): 120–139.
Anderson, C. 2008. The end of theory: The data deluge makes the scientific method
obsolete. Wired Magazine 16:07 (June 23). http://www.wired.com/science/
discoveries/magazine/16–07/pb_theory.
Ankolekar, A., M. Krotzsch, T. Tran, and D. Vrandecic. 2007. The two cultures:
Mashing up Web 2.0 and the Semantic Web. In Proc Sixteenth Int Conf on World
Wide Web (WWW’07), 825–834. New York: ACM.
Berry, D. 2011. The computational turn: Thinking about the digital humanities. Culture
Machine 12 (July 11). http://www.culturemachine.net/index.php/cm/article/
view/440/470.
Beyer, M. A., A. Lapkin, N. Gall, D. Feinberg, and V. T. Sribar. 2011. ‘Big Data’ is only
the beginning of extreme information management. Gartner Inc. (April). http://
www.gartner.com/id=1622715 (accessed August 30, 2012).
Bizer, C., T. Heath, and T. Berners-Lee. 2009. Linked data—The story so far. International
Journal on Semantic Web and Information Systems 5(3): 1–22.
Bollier, D. 2010. The promise and peril of big data. Report, Eighteenth Annual
Aspen Institute Roundtable on Information Technology, the Aspen Institute.
http://www.aspeninstitute.org/sites/default/files/content/docs/pubs/The_
Promise_and_Peril_of_Big_Data.pdf (accessed August 30, 2012).
Bowker, G. C. 2005. Memory Practices in the Sciences. Cambridge, MA: MIT Press.
Boyd, D. and K. Crawford. 2012. Critical questions for big data. Information, Communication
& Society 15(5): 662–679.
Broekstra, J., A. Kampman, and F. van Harmelen. 2002. Sesame: A generic architecture
for storing and querying RDF and RDF schema. In The Semantic Web—ISWC
2002, eds. I. Horrocks and J. Hendler, 54–68. Berlin, Heidelberg: Springer-Verlag,
LNCS 2342.
Cai, M. and M. Frank. 2004. RDFPeers: A scalable distributed RDF repository based
on a structured peer-to-peer network. In Proc Thirteenth Int Conf World Wide
Web (WWW’04), 650–657. New York: ACM.
Capgemini. 2012. The deciding factor: Big data & decision making. Report. http://
www.capgemini.com/services-and-solutions/technology/business-informa-
tion-management/the-deciding-factor/ (accessed August 30, 2012).
Chang, F., J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,
T. Chandra, A. Fikes, and R. E. Gruber. 2008. Bigtable: A distributed storage
Toward Evolving Knowledge Ecosystems for Big Data Understanding 51

system for structured data. ACM Transactions on Computer Systems 26(2):


a­ rticle 4.
Cochez, M. and V. Terziyan. 2012. Quality of an ontology as a dynamic optimisation
problem. In Proc Eighth Int Conf ICTERI 2012, eds. V. Ermolayev et al., 249–256.
CEUR-WS vol. 848. http://ceur-ws.org/Vol-848/ICTERI-2012-CEUR-WS-DEIS-
paper-1-p-249-256.pdf.
Collins, A. M. and E. F. Loftus. 1975. A spreading-activation theory of semantic pro-
cessing. Psychological Review 82(6): 407–428.
Cusumano, M. 2010. Cloud computing and SaaS as new computing platforms.
Communications of the ACM 53(4): 27–29.
Darwin, C. 1859. On the Origin of Species by Means of Natural Selection, or the Preservation
of Favoured Races in the Struggle for Life. London: John Murrey.
Davidovsky, M., V. Ermolayev, and V. Tolok. 2011. Instance migration between ontol-
ogies having structural differences. International Journal on Artificial Intelligence
Tools 20(6): 1127–1156.
Davidovsky, M., V. Ermolayev, and V. Tolok. 2012. Agent-based implementation for
the discovery of structural difference in OWL DL ontologies. In Proc. Fourth Int
United Information Systems Conf (UNISCON 2012), eds. H. C. Mayr, A. Ginige,
and S. Liddle, Berlin, Heidelberg: Springer-Verlag, LNBIP 137.
Dean, J. and S. Ghemawat. 2008. MapReduce: Simplified data processing on large
clusters. Communications of the ACM 51(1): 107–113.
Dean, D. and C. Webb. 2011. Recovering from information overload. McKinsey
Quarterly. http://www.mckinseyquarterly.com/Recovering_from_information_
overload_2735 (accessed October 8, 2012).
DeCandia, G., D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin,
S.  Sivasubramanian, P. Vosshall, and W. Vogels. 2007. Dynamo: Amazon’s
highly available key-value store. In 21st ACM Symposium on Operating Systems
Principles, eds. T. C. Bressoud and M. Frans Kaashoek, 205–220. New York: ACM.
Dickinson, I. and M. Wooldridge. 2003. Towards practical reasoning agents for the
semantic web. In Proc. of the Second International Joint Conference on Autonomous
Agents and Multiagent Systems, 827–834. New York: ACM.
Driscoll, M. 2011. Building data startups: Fast, big, and focused. O’Reilly Radar
(9).  http://radar.oreilly.com/2011/08/building-data-startups.html (accessed
October 8, 2012).
Ermolayev, V. and M. Davidovsky. 2012. Agent-based ontology alignment: Basics,
applications, theoretical foundations, and demonstration. In Proc. Int Conf on
Web Intelligence, Mining and Semantics (WIMS 2012), eds. D. Dan Burdescu, R.
Akerkar, and C. Badica, 11–22. New York: ACM.
Ermolayev, V., N. Keberle, O. Kononenko, S. Plaksin, and V. Terziyan. 2004. Towards
a framework for agent-enabled semantic web service composition. International
Journal of Web Services Research 1(3): 63–87.
Ermolayev, V., N. Keberle, W.-E. Matzke, and V. Vladimirov. 2005. A strategy for auto-
mated meaning negotiation in distributed information retrieval. In Proc 4th Int
Semantic Web Conference (ISWC’05), eds. Y. Gil et al., 201–215. Berlin, Heidelberg:
Springer-Verlag, LNCS 3729.
Ermolayev, V., N. Keberle, and W.-E. Matzke. 2008. An ontology of environments,
events, and happenings, computer software and applications, 2008. COMPSAC
'08. 32nd Annual IEEE International, pp. 539, 546, July 28, 2008–Aug. 1, 2008.
doi: 10.1109/COMPSAC.2008.141
52 Big Data Computing

Ermolayev, V., C. Ruiz, M. Tilly, E. Jentzsch, J.-M. Gomez-Perez, and W.-E. Matzke.
2010. A context model for knowledge workers. In Proc Second Workshop on
Content, Information, and Ontologies (CIAO 2010), eds. V. Ermolayev, J.-M.
Gomez-Perez, P. Haase, and P. Warren, CEUR-WS, vol. 626. http://ceur-ws.
org/Vol-626/regular2.pdf (online).
Euzenat, J. and P. Shvaiko. 2007. Ontology Matching. Berlin, Heidelberg: Springer-Verlag.
Fan, W., A. Bifet, Q. Yang, and P. Yu. 2012a. Foreword. In Proc First Int Workshop on Big
Data, Streams, and Heterogeneous Source Mining: Algorithms, Systems, Programming
Models and Applications, eds. W. Fan, A. Bifet, Q. Yang, and P. Yu, New York: ACM.
Fan, J., A. Kalyanpur, D. C. Gondek, and D. A. Ferrucci. 2012b. Automatic knowledge
extraction from documents. IBM Journal of Research and Development 56(3.4):
5:1–5:10.
Fensel, D., F. van Harmelen, B. Andersson, P. Brennan, H. Cunningham, E. Della
Valle, F. Fischer et al. 2008. Towards LarKC: A platform for web-scale reason-
ing, Semantic Computing, 2008 IEEE International Conference on, pp. 524, 529,
4–7 Aug. 2008. doi: 10.1109/ICSC.2008.41.
Fisher, D., R. DeLine, M. Czerwinski, and S. Drucker. 2012. Interactions with big data
analytics. Interactions 19(3):50–59.
Gangemi, A. and V. Presutti. 2009. Ontology design patterns. In Handbook on
Ontologies, eds. S. Staab and R. Studer, 221–243. Berlin, Heidelberg: Springer-
Verlag, International Handbooks on Information Systems.
Ghemawat, S., H. Gobioff, and S.-T. Leung. 2003. The Google file system. In Proc
Nineteenth ACM Symposium on Operating Systems Principles (SOSP’03), 29–43.
New York: ACM.
Golab, L. and M. Tamer Ozsu. 2003. Issues in data stream management. SIGMOD
Record 32(2): 5–14.
Gordon, A. 2005. Privacy and ubiquitous network societies. In Workshop on ITU
Ubiquitous Network Societies, 6–15.
Greller, W. 2012. Reflections on the knowledge society. http://wgreller.­wordpress.
com/2010/11/03/big-data-isnt-big-knowledge-its-big-business/  (accessed
August 20, 2012).
Gu, Y. and R. L. Grossman. 2009. Sector and sphere: The design and implementation
of a high-performance data cloud. Philosophical Transactions of the Royal Society
367(1897): 2429–2445.
Guarino, N. and C. Welty. 2001. Supporting ontological analysis of taxonomic rela-
tionships. Data and Knowledge Engineering 39(1): 51–74.
Guéret, C., E. Oren, S. Schlobach, and M. Schut. 2008. An evolutionary perspective
on approximate RDF query answering. In Proc Int Conf on Scalable Uncertainty
Management, eds. S. Greco and T. Lukasiewicz, 215–228. Berlin, Heidelberg:
Springer-Verlag, LNAI 5291.
He, B., M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou. 2010. Comet: Batched
stream processing for data intensive distributed computing, In Proc First ACM
symposium on Cloud Computing (SoCC’10), 63–74. New York: ACM.
Hepp, M. 2007. Possible ontologies: How reality constrains the development of rel-
evant ontologies. IEEE Internet Computing 11(1): 90–96.
Hogan, A., J. Z. Pan, A. Polleres, and Y. Ren. 2011. Scalable OWL 2 reasoning for linked
data. In Lecture Notes for the Reasoning Web Summer School, Galway, Ireland
(August). http://aidanhogan.com/docs/rw_2011.pdf (accessed October 18,
2012).
Toward Evolving Knowledge Ecosystems for Big Data Understanding 53

Isaac, A., C. Trojahn, S. Wang, and P. Quaresma. 2008. Using quantitative aspects
of alignment generation for argumentation on mappings. In Proc ISWC’08
Workshop on Ontology Matching, ed. P. Shvaiko, J. Euzenat, F. Giunchiglia, and
H. Stuckenschmidt, CEUR-WS Vol-431. http://ceur-ws.org/Vol-431/om2008_
Tpaper5.pdf (online).
Ishai, Y., E. Kushilevitz, R. Ostrovsky, and A. Sahai. 2009. Extracting correlations,
Foundations of Computer Science, 2009. FOCS '09. 50th Annual IEEE Symposium
on, pp. 261, 270, 25–27 Oct. 2009. doi: 10.1109/FOCS.2009.56.
Joseph, A. 2012. A Berkeley view of big data. Closing keynote of Eduserv Symposium
2012: Big Data, Big Deal? http://www.eduserv.org.uk/newsandevents/
events/2012/symposium/closing-keynote (accessed October 8, 2012).
Keberle, N. 2009. Temporal classes and OWL. In Proc Sixth Int Workshop on OWL:
Experiences and Directions (OWLED 2009), eds. R. Hoekstra and P. F. Patel-
Schneider, CEUR-WS, vol 529. http://ceur-ws.org/Vol-529/owled2009_sub-
mission_27.pdf (online).
Kendall, E., R. Bell, R. Burkhart, M. Dutra, and E. Wallace. 2009. Towards a graphical
notation for OWL 2. In Proc Sixth Int Workshop on OWL: Experiences and Directions
(OWLED 2009), eds. R. Hoekstra and P. F. Patel-Schneider, CEUR-WS, vol 529.
http://ceur-ws.org/Vol-529/owled2009_submission_47.pdf (online).
Klinov, P., C. del Vescovo, and T. Schneider. 2012. Incrementally updateable and
persistent decomposition of OWL ontologies. In Proc OWL: Experiences and
Directions Workshop, ed. P. Klinov and M. Horridge, CEUR-WS, vol 849. http://
ceur-ws.org/Vol-849/paper_7.pdf (online).
Kontchakov, R., C. Lutz, D. Toman, F. Wolter, and M. Zakharyaschev. 2010. The com-
bined approach to query answering in DL-Lite. In Proc Twelfth Int Conf on the
Principles of Knowledge Representation and Reasoning (KR 2010), eds. F. Lin and U.
Sattler, 247–257. North America: AAAI.
Knuth, D. E. 1998. The Art of Computer Programming. Volume 3: Sorting and Searching.
Second Edition, Reading, MA: Addison-Wesley.
Labrou, Y. 2006. Standardizing agent communication. In Multi-Agent Systems and
Applications, eds. M. Luck, V. Marik, O. Stepankova, and R. Trappl, 74–97. Berlin,
Heidelberg: Springer-Verlag, LNCS 2086.
Labrou, Y., T. Finin, and Y. Peng. 1999. Agent communication languages: The current
landscape. IEEE Intelligent Systems 14(2): 45–52.
Lenat, D. B. 1995. CYC: A large-scale investment in knowledge infrastructure.
Communications of the ACM 38(11): 33–38.
Lin, J. and C. Dyer. 2010. Data-Intensive Text Processing with MapReduce. Morgan &
Claypool Synthesis Lectures on Human Language Technologies. http://lintool.
github.com/MapReduceAlgorithms/MapReduce-book-final.pdf.
Manyika, J., M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. Hung Byers.
2011. Big data: The next frontier for innovation, competition, and productivity.
McKinsey Global Institute (May). http://www.mckinsey.com/insights/mgi/
research/technology_and_innovation/big_data_the_next_frontier_for_inno-
vation (accessed October 8, 2012).
McGlothlin, J. P. and L. Khan. 2010. Materializing inferred and uncertain knowledge
in RDF datasets. In Proc Twenty-Fourth AAAI Conference on Artificial Intelligence
(AAAI-10), 1951–1952. North America: AAAI.
Mills, P. 2011. Efficient statistical classification of satellite measurements. International
Journal of Remote Sensing 32(21): 6109–6132.
54 Big Data Computing

Mitchell, I. and M. Wilson. 2012. Linked Data. Connecting and Exploiting Big Data.
Fujitsu White Paper (March). http://www.fujitsu.com/uk/Images/Linked-
data-connecting-and-exploiting-big-data-(v1.0).pdf.
Nardi, D. and R. J. Brachman. 2007. An introduction to description logics. In
The Description Logic Handbook, eds. F. Baader, D. Calvanese, D. L. McGuinness,
D. Nardi, and P. F. Patel-Schneider. New York: Cambridge University Press.
Nemani, R. R. and R. Konda. 2009. A framework for data quality in data warehousing.
In Information Systems: Modeling, Development, and Integration, eds. J. Yang, A.
Ginige, H. C. Mayr, and R.-D. Kutsche, 292–297. Berlin, Heidelberg: Springer-
Verlag, LNBIP 20.
Olston, C. 2012. Programming and debugging large scale data processing workflows.
In First Int Workshop on Hot Topics in Cloud Data Processing (HotCDP’12),
Switzerland.
Oren, E., S. Kotoulas, G. Anadiotis, R. Siebes, A. ten Teije, and F. van Harmelen. 2009.
Marvin: Distributed reasoning over large-scale Semantic Web data. Journal of
Web Semantics 7(4): 305–316.
Ponniah, P. 2010. Data Warehousing Fundamentals for IT Professionals. Hoboken, NJ:
John Wiley & Sons.
Puuronen, S., V. Terziyan, and A. Tsymbal. 1999. A dynamic integration algorithm
for an ensemble of classifiers. In Foundations of Intelligent Systems: Eleventh Int
Symposium ISMIS’99, eds. Z.W. Ras and A. Skowron, 592–600. Berlin, Heidelberg:
Springer-Verlag, LNAI 1609.
Quillian, M. R. 1967. Word concepts: A theory and simulation of some basic semantic
capabilities. Behavioral Science 12(5): 410–430.
Quillian, M. R. 1969. The teachable language comprehender: A simulation program
and theory of language. Communications of the ACM 12(8): 459–476.
Rahwan, T. 2007. Algorithms for coalition formation in multi-agent systems. PhD
diss., University of Southampton. http://users.ecs.soton.ac.uk/nrj/download-
files/lesser-award/rahwan-thesis.pdf (accessed October 8, 2012).
Rimal, B. P., C. Eunmi, and I. Lumb. 2009. A taxonomy and survey of cloud comput-
ing systems. In Proc Fifth Int Joint Conf on INC, IMS and IDC, 44–51. Washington,
DC: IEEE CS Press.
Roy, G., L. Hyunyoung, J. L. Welch, Z. Yuan, V. Pandey, and D. Thurston. 2009. A
distributed pool architecture for genetic algorithms, Evolutionary Computation,
2009. CEC '09. IEEE Congress on, pp. 1177, 1184, 18–21 May 2009. doi: 10.1109/
CEC.2009.4983079
Sakr, S., A. Liu, D.M. Batista, and M. Alomari. 2011. A survey of large scale data
management approaches in cloud environments. IEEE Communications Society
Surveys & Tutorials 13(3): 311–336.
Salehi, A. 2010. Low Latency, High Performance Data Stream Processing: Systems
Architecture. Algorithms and Implementation. Saarbrücken: VDM Verlag.
Shvachko, K., K. Hairong, S. Radia, R. Chansler. 2010. The Hadoop distributed file
system, Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium
on, pp.1,10, 3–7 May 2010. doi: 10.1109/MSST.2010.5496972.
Smith, B. 2012. Big data that might benefit from ontology technology, but why this
usually fails. In Ontology Summit 2012, Track 3 Challenge: Ontology and Big
Data. http://ontolog.cim3.net/file/work/OntologySummit2012/2012-02-09_
BigDataChallenge-I-II/Ontology-for-Big-Data—BarrySmith_20120209.pdf
(accessed October 8, 2012).
Toward Evolving Knowledge Ecosystems for Big Data Understanding 55

Tatarintseva, O., V. Ermolayev, and A. Fensel. 2011. Is your Ontology a burden or a


Gem?—Towards Xtreme Ontology engineering. In Proc Seventh Int Conf ICTERI
2011, eds. V. Ermolayev et al., 65–81. CEUR-WS, vol. 716. http://ceur-ws.org/
Vol-716/ICTERI-2011-CEUR-WS-paper-4-p-65–81.pdf (online).
Terziyan, V. 2001. Dynamic integration of virtual predictors. In Proc Int ICSC Congress
on Computational Intelligence: Methods and Applications (CIMA’2001), eds. L. I.
Kuncheva et al., 463–469. Canada: ICSC Academic Press.
Terziyan, V. 2007. Predictive and contextual feature separation for Bayesian meta-
networks. In Proc KES-2007/WIRN-2007, ed. B. Apolloni et al., 634–644. Berlin,
Heidelberg: Springer-Verlag, LNAI 4694.
Terziyan, V. and O. Kaykova. 2012. From linked data and business intelligence to
executable reality. International Journal on Advances in Intelligent Systems 5(1–2):
194–208.
Terziyan, V., A. Tsymbal, and S. Puuronen. 1998. The decision support system for tele-
medicine based on multiple expertise. International Journal of Medical Informatics
49(2): 217–229.
Thomason, R. H. 1998. Representing and reasoning with context. In Proc Int Conf on
Artificial Intelligence and Symbolic Computation (AISC 1998), eds. J. Calmet and J.
Plaza, 29–41. Berlin, Heidelberg: Springer-Verlag, LNAI 1476.
Thusoo, A., Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. S. Sarma, R. Murthy, and
H. Liu. 2010. Data warehousing and analytics infrastructure at Facebook. In
Proc 2010 ACM SIGMOD Int Conf on Management of Data, 1013–1020. New York:
ACM.
Tsangaris, M. M., G. Kakaletris, H. Kllapi, G. Papanikos, F. Pentaris, P. Polydoras, E.
Sitaridi, V. Stoumpos, and Y. E. Ioannidis. 2009. Dataflow processing and opti-
mization on grid and cloud infrastructures. IEEE Data Engineering Bulletin 32(1):
67–74.
Urbani, J., S. Kotoulas, E. Oren, and F. van Harmelen. 2009. Scalable distributed rea-
soning using MapReduce. In Proc Eighth Int Semantic Web Conf (ISWC’09),
eds. A. Bernstein, D. R. Karger, T. Heath, L. Feigenbaum, D. Maynard, E. Motta,
and K. Thirunarayan, 634–649. Berlin, Heidelberg: Springer-Verlag.
W3C. 2009. OWL 2 web ontology language profiles. W3C Recommendation (October).
http://www.w3.org/TR/owl2-profiles/.
Weinberger, D. 2012. Too Big to know. Rethinking Knowledge now that the Facts aren’t the
Facts, Experts are Everywhere, and the Smartest Person in the Room is the Room. First
Edition. New York, NY: Basic Books.
Wielemaker, J., G. Schreiber, and B. Wielinga. 2003. Prolog-based infrastructure
for RDF: Scalability and performance. In The Semantic Web—ISWC 2003, eds.
D. Fensel, K. Sycara, and J. Mylopoulos, 644–658. Berlin, Heidelberg: Springer-
Verlag, LNCS 2870.
Wooldridge, M. and N. R. Jennings. 1995. Intelligent agents: Theory and practice. The
Knowledge Engineering Review 10(2): 115–152.
This page intentionally left blank
2
Tassonomy and Review of Big Data
Solutions Navigation

Pierfrancesco Bellini, Mariano di Claudio, Paolo Nesi, and Nadia Rauch

Contents
Introduction............................................................................................................ 58
Main Requirements and Features of Big Data Solutions..................................65
Infrastructral and Architectural Aspects........................................................65
Scalability.......................................................................................................65
High Availability.......................................................................................... 67
Computational Process Management........................................................ 68
Workflow Automation................................................................................. 68
Cloud Computing........................................................................................ 69
Self-Healing................................................................................................... 70
Data Management Aspects.............................................................................. 70
Database Size................................................................................................ 71
Data Model.................................................................................................... 71
Resources....................................................................................................... 72
Data Organization........................................................................................ 73
Data Access for Rendering.......................................................................... 74
Data Security and Privacy........................................................................... 74
Data Analytics Aspects..................................................................................... 75
Data Mining/Ingestion................................................................................ 76
Data Access for Computing........................................................................77
Overview of Big Data Solutions........................................................................... 78
Couchbase...................................................................................................... 79
eXist................................................................................................................ 82
Google Map-Reduce..................................................................................... 82
Hadoop..........................................................................................................83
Hbase..............................................................................................................83
Hive................................................................................................................84
MonetDB........................................................................................................84
MongoDB.......................................................................................................85
Objectivity.....................................................................................................85
OpenQM........................................................................................................ 86
RDF-3X........................................................................................................... 86

57
58 Big Data Computing

Comparison and Analysis of Architectural Features........................................ 87


Application Domains Comparison...................................................................... 87
Conclusions............................................................................................................. 96
References................................................................................................................ 97

Introduction
Although the management of huge and growing volumes of data is a chal-
lenge for the past many years, no long-term solutions have been found so far.
The term “Big Data” initially referred to huge volumes of data that have the
size beyond the capabilities of current database technologies, consequently
for “Big Data” problems one referred to the problems that present a combina-
tion of large volume of data to be treated in short time. When one establishes
that data have to be collected and stored at an impressive rate, it is clear
that the biggest challenge is not only about the storage and management,
their analysis, and the extraction of meaningful values, but also deductions
and actions in reality is the main challenge. Big Data problems were mostly
related to the presence of unstructured data, that is, information that either
do not have a default schema/template or that do not adapt well to relational
tables; it is therefore necessary to turn to analysis techniques for unstruc-
tured data, to address these problems.
Recently, the Big Data problems are characterized by a combination of the
so-called 3Vs: volume, velocity, and variety; and then a fourth V too has been
added: variability. In essence, every day a large volume of information is pro-
duced and these data need a sustainable access, process, and preservation
according to the velocity of their arrival, and therefore, the management of
large volume of data is not the only problem. Moreover, the variety of data,
metadata, access rights and associating computing, formats, semantics, and
software tools for visualization, and the variability in structure and data
models significantly increase the level of complexity of these problems. The
first V, volume, describes the large amount of data generated by individuals,
groups, and organizations. The volume of data being stored today is explod-
ing. For example, in the year 2000 about 800,000 petabytes of data in the world
were generated and stored (Eaton et al., 2012) and experts estimated that in
the year 2020, about 35 zettabyte of data will be produced. The second V,
velocity, refers to speed at which Big Data are collected, processed, and elabo-
rated, may handle a constant flow of massive data, which are impossible to
be processed with traditional solutions. For this reason, it is not only impor-
tant to consider “where” the data are stored, but also “how” they are stored.
The third V, variety, is concerned with the proliferation of data types from
social, mobile sources, machine-to-machine, and traditional data that are
part of it. With the explosion of social networks, smart devices, and sensors,
Tassonomy and Review of Big Data Solutions Navigation 59

data have become complex, because they include raw, semistructured, and
unstructured data from log files, web pages, search indexes, cross media,
emails, documents, forums, and so on. Variety represents all types of data
and usually the enterprises must be able to analyze all of them, if they want
to gain advantage. Finally, variability, the last V, refers to data unpredictabil-
ity and to how these may change in the years, following the implementation
of the architecture. Moreover, the concept of variability can be attributed to
assigning a variable interpretation to the data and to the confusions created
in Big Data analysis, referring, for example, to different meanings in Natural
Language that some data may have. These four properties can be considered
orthogonal aspects of data storage, processing, and analysis and it is also
interesting that increasing variety and variability also increases the attrac-
tiveness of data and their potentiality in providing hidden and unexpected
information/meanings.
Especially in science, the need of new “infrastructures for global research
data” that can achieve interoperability to overcome the limitations related to
language, methodology, and guidelines (policy) would be needed in short
time. To cope with these types of complexities, several different techniques
and tools may be needed, they have to be composed and new specific algo-
rithms and solutions may also have to be defined and implemented. The
wide range of problems and the specifics needs make almost impossible to
identify unique architectures and solutions adaptable to all possible applica-
tive areas. Moreover, not only the number of application areas so different
from each other, but also the different channels through which data are daily
collected increases the difficulties of companies and developers to identify
which is the right way to achieve relevant results from the accessible data.
Therefore, this chapter can be a useful tool for supporting the researchers
and technicians in making decisions about setting up some Big Data infra-
structure and solutions. To this end, it is very helpful to have an overview
about Big Data techniques; it can be used as a sort of guidelines to better
understand possible differences and relevant best features among the many
needed and proposed by the product as the key aspects of Big Data solutions.
These can be regarded as requirements and needs according to which the
different solutions can be compared and assessed, in accordance with the
case study and/or application domain.
To this end, and to better understand the impact of Big Data science and
solutions, in the following, a number of examples describing major appli-
cative domains taking advantage from the Big Data technologies and solu-
tions are reported: education and training, cultural heritage, social media
and social networking, health care, research on brain, financial and business,
marketing and social marketing, security, smart cities and mobility, etc.
Big Data technologies have the potential to revolutionize education.
Educational data such as students’ performance, mechanics of learning, and
answers to different pedagogical strategies can provide an improved under-
standing of students’ knowledge and accurate assessments of their progress.
60 Big Data Computing

These data can also help identify clusters of students with similar learn-
ing style or difficulties, thus defining a new form of customized education
based on sharing resources and supported by computational models. The
proposed new models of teaching in Woolf et al. (2010) are trying to take into
account student profile and performance, pedagogical and psychological
and learning mechanisms, to define personalized instruction courses and
activities that meet the different needs of different individual students and/
or groups. In fact, in the educational sector, the approach to collect, mine,
and analyze large data sets has been consolidated, in order to provide new
tools and information to the key stakeholders. This data analysis can provide
an increasing understanding of students’ knowledge, improve the assess-
ments of their progress, and help focus questions in education and psychol-
ogy, such as the method of learning or how different students respond to
different pedagogical strategies. The collected data can also be used to define
models to understand what students actually know and understand how
to enrich this knowledge, and assess which of the adopted techniques can
be effective in whose cases, and finally produce a case-by-case action plan.
In terms of Big Data, a large variety and variability of data is presented to
take into account all events in the students’ career; the data volume is also
an additional factor. Another sector of interest, in this field, is the e-learn-
ing domain, where two main kinds of users are defined: the learners and
the learning providers (Hanna, 2004). All personal details of learners and
the online learning providers’ information are stored in specific database,
so applying data mining with e-learning can enable one to realize teach-
ing programs targeted to particular interests and needs through an efficient
decision making.
For the management of large amounts of cultural heritage information data,
Europeana has been created with over 20 millions of content indexed that can
be retrieved in real time. Earlier, each of them was modeled with a simple
metadata model, ESE, while a new and more complete models called EDM
(Europeana Data Model) with a set of semantic relationships is going to be
adopted in the 2013 [Europeana]. A number of projects and activities are con-
nected to Europeana network to aggregate content and tools. Among them
ECLAP is a best practice network that collected not only content metadata for
Europeana, but also real content files from over 35 different institutions hav-
ing different metadata sets and over 500 file formats. A total of more than 1
million of cross media items is going to be collected with an average of some
hundreds of metadata each, thus resulting in billions of information elements
and multiple relationships among them to be queried, navigated, and accessed
in real time by a large community of users [ECLAP] (Bellini et al., 2012a).
The volume of data generated by social network is huge and with a high
variability in the data flow over time and space, due to human factor; for
example, Facebook receives 3 billion uploads per month, which corresponds
to approximately 3600 TB/year. Search engine companies such as Google
and Yahoo! collect every day trillions of bytes of data, around which real new
Tassonomy and Review of Big Data Solutions Navigation 61

business is developed, offering useful services to its users and companies in


real time (Mislove et al., 2006). From these large amounts of data collected
through social networks (e.g., Facebook, Twitter, MySpace), social media and
Big Data solutions may estimate the user-collective profiles and behavior,
analyze product acceptance, evaluate the market trend, keep trace of user
movements, extract unexpected correlations, evaluate the models of influ-
ence, and perform different kinds of predictions (Domingos, 2005). Social
media data can be exploited by considering geo-referenced information and
Natural Language Processing for analyzing and interpreting urban living:
massive folk movements, activities of the different communities in the city,
movements due to large public events, assessment of the city infrastructures,
etc. (Iaconesi and Persico, 2012). In a broader sense, by this information it
is possible to extract knowledge and data relationships, by improving the
activity of query answering.
For example, in Healthcare/Medical field large amount of information about
patients’ medical histories, symptomatology, diagnoses, and responses to
treatments and therapies is collected. Data mining techniques might be
implemented to derive knowledge from this data in order to either iden-
tify new interesting patterns in infection control data or to examine report-
ing practices (Obenshain, 2004). Moreover, predictive models can be used
as detection tools exploiting Electronic Patient Record (EPR) accumulated for
each person of the area, and taking into account the statistical data. Similar
solutions can be adopted as decision support for specific triage and diagnosis
or to produce effective plans for chronic disease management, enhancing the
quality of healthcare and lowering its cost. This activity may allow detecting
the inception of critical conditions for the observed people over the whole
population. In Mans et  al. (2009), techniques to the fast access and extrac-
tion of information from event’s log from medical processes, to produce eas-
ily interpretable models, using partitioning, clustering, and preprocessing
techniques have been investigated. In medical field, especially hospital, run
time data are used to support the analysis of existing processes. Moreover,
taking into account genomic aspects and EPR for millions of patients leads to
cope with Big Data problems. For genome sequencing activities (HTS, high-
throughput sequencing) that produce several hundreds of millions of small
sequences, a new data structure for indexing called Gkarrays (Rivals et al.,
2012) has been proposed, with the aim of improving classical indexing system
such as hash table. The adoption of sparse hash tables is not enough to index
huge collections of k-mer (subword of a given length k in a DNA sequence,
which represents the minimum unit accessed). Therefore, new data structure
has been proposed based on three arrays: the first for storing the start posi-
tion of each k-mer, the second as an inverted array allows finding any k-mer
from a position in a read, and the last records the interval of position of each
distinct k-mer, in sorted order. This structure allowed obtaining in constant
time, the number of reads that contain a k-mer. A project of the University of
Salzburg with the National Institute of sick of Salzburg studies how to apply
62 Big Data Computing

machine learning techniques to the evaluation of large amounts of tomo-


graphic images generated by computer (Zinterhof, 2012). The idea is to apply
proven techniques of machine learning for image segmentation, in the field
of computer tomography.
In several areas of science and research such as astronomy (automated sky
survey), sociology (web log analysis of behavioral data), and neuroscience
(genetic and neuroimaging data analysis), the aim of Big Data analysis
is to extract meaning from data and determine what actions take. To cope
with the large amount of experimental data produced by research experi-
ments, the University Montpellier started the ZENITH project [Zenith] that
adopts a hybrid architecture p2p/cloud (Valduriez and Pacitti, 2005). The
idea of Zenith is to exploit p2p to facilitate the collaborative nature of scien-
tific data, centralized control, and use the potentialities of computing, stor-
age, and network resources in the Cloud model, to manage and analyze this
large amount of data. The storage infrastructure used in De Witt et al. (2012)
is called CASTOR and allows for the management of metadata related to
scientific files of experiments at CERN. For example, the database of RAL
(Rutherford Appleton Laboratory) uses a single table for storing 20 GB (which
reproduces the hierarchical structure of the file) that runs about 500 transac-
tions per second on 6 clusters. With the increasing number of digital scientific
data, one of the most important challenges is the digital preservation and for
this purpose is in progress the SCAPE (SCAlable Preservation Environment)
project [SCAPE Project]. The platform provides an extensible infrastructure
to achieve the conservation of workflow information of large volume of data.
The AzureBrain project (Antoniu et  al., 2010) aims to explore cloud com-
puting techniques for the analysis of data from genetic and neuroimaging
domains, both characterized by a large number of variables. The Projectome
project, connected with the Human Brain Project, HBP, aims to set up a high-
performance infrastructure for processing and visualizing neuroanatomical
information obtained by using confocal ultramicroscopy techniques (Silvestri
et al., 2012), the solution is connected with the modeling of knowledge of and
information related to rat brains. Here, the single image scan of a mouse is
more than 1 Tbyte and it is 1000 times smaller than a human brain.
The task of finding patterns in business data is not new; nowadays it is get-
ting a larger relevance because enterprises are collecting and producing a
huge amount of data including massive contextual information, thus tak-
ing into account a larger number of variables. Using data to understand and
improve business operations, profitability, and growth is a great opportunity
and a challenge in evolving. The continuous collection of large amounts of
data (business transaction, sales transaction, user behavior), widespread use
of networking technologies and computers, and design of Big Data warehouse
and data mart have created enormously valuable assets. An interesting pos-
sibility to extract meaningful information from these data could be the use
of machine learning techniques in the context of mining business data (Bose
and Mahapatra, 2001), or also to use an alternative approach of structured
Tassonomy and Review of Big Data Solutions Navigation 63

data mining to model classes of customers in client databases using fuzzy


clustering and fuzzy decision making (Setnes et al., 2001). These data can be
analyzed in order to define prediction about the behavior of users, to identify
buying pattern of individual/group customers, and to provide new custom
services (Bose and Mahapatra, 2001). Moreover, in recent years, the major mar-
ket analysts conduct their business investigations with data that are not stored
within the classic RDBMS (Relational DataBase Management System), due to
the increase of various and new types of information. Analysis of web users’
behavior, customer loyalty programs, the technology of remote sensors, com-
ments into blogs, and opinions shared on the network are contributing to cre-
ate a new business model called social media marketing and the companies must
properly manage these information, with the corresponding potential for
new understanding, to maximize the business value of the data (Domingos,
2005). In financial field, instead, investment and business plans may be created
thanks to predictive models derived using techniques of reasoning and used
to discover meaningful and interesting patterns in business data.
Big Data technologies have been adopted to find solutions to logistic and
mobility management and optimization of multimodal transport networks in
the context of Smart Cities. A data-centric approach can also help for enhanc-
ing the efficiency and the dependability of a transportation system. In fact,
through the analysis and visualization of detailed road network data and
the use of a predictive model, it is possible to achieve an intelligent trans-
portation environment. Furthermore, through the merging of high-fidelity
geographical stored data and real-time sensor networks scattered data, it
can be made an efficient urban planning system that mix public and pri-
vate transportation, offering people more flexible solutions. This new way
of traveling has interesting implications for energy and environment. The
analysis of the huge amount of data collected from the metropolitan mul-
timodal transportation infrastructure, augmented with data coming from
sensors, GPS positions, etc., can be used to facilitate the movements of people
via local public transportation solutions and private vehicles (Liu et al., 2009).
The idea is to provide intelligent real-time information to improve traveler
experience and operational efficiencies (see, for example, the solutions for the
cities of Amsterdam, Berlin, Copenhagen, and Ghent). In this way, in fact, it
is possible in order to use the Big Data both as historical and real-time data
for the applications of machine learning algorithms aimed to traffic state
­estimation/planning and also to detect unpredicted phenomena in a suf-
ficiently accurate way to support near real-time decisions.
In security field, Intelligence, Surveillance, and Reconnaissance (ISR) define
topics that are well suited for data-centric computational analyses. Using
analysis tools for video and image retrieval, it is possible to establish alert for
activity and event of interest. Moreover, intelligence services can use these
data to detect and combine special patterns and trends, in order to recog-
nize threats and to assess the capabilities and vulnerabilities with the aim to
increase the security level of a nation (Bryant et al., 2010).
64 Big Data Computing

In the field of energy resources optimization and environmental monitoring,


the  data related to the consumption of electricity are very important. The
analysis of a set of load profiles and geo-referenced information, with appro-
priate data mining techniques (Figueireido et al., 2005), and the construction
of predictive models from that data, could define intelligent distribution strat-
egies in order to lower costs and improve the quality of life in this field, and
another possible solution is an approach that provides for the adoption of
a conceptual model for a smart grid data management based on the main
features of a cloud computing platform, such as collection and real-time man-
agement of distributed data, parallel processing for research and interpreta-
tion of information, multiple and ubiquitous access (Rusitschka et al., 2010).
In the above overview about some of the application domains for Big Data
technologies, it is evident that to cope with those problems several different
kinds of solutions and specific products have been developed. Moreover, the
complexity and the variability of the problems have been addressed with a
combination of different open source or proprietary solutions, since presently
there is not an ultimate solution to the Big Data problem that includes in an
integrated manner data gathering, mining, analysis, processing, accessing,
publication, and rendering. It would therefore be extremely useful if a “map”
of the hot spots to be taken into account, during the design process and the
creation of these architectures, which helps the technical staff to orient them-
selves in the wide range of products accessible on the Internet and/or offered
by the market. To this aim, we have tried to identify the main features that
can characterize architectures for solving a Big Data problem, depending on
the source of data, on the type of processing required, and on the application
context in which should be to operate.
The paper is organized as follows. In the “Main Requirements and Features
of Big Data Solutions” section, the main requirements and features for Big
Data solutions are presented by taking into account infrastructural and
architectural aspects, data management, and data analytics aspects. Section
“Overview of Big Data Solutions” reports a brief overview of existing solu-
tions for Big Data and their main application fields. In the “Comparison and
Analysis of Architectural Features” section, a comparison and analysis of
the architectural features is presented. The analysis has permitted to put in
evidence the most relevant features and different among the different solu-
tion. Section “Application Domains Comparison” is characterized by the
description of the main application domains of the Big Data technologies and
includes our assessment of these applicative domains in terms of the identified
features reported in the “Overview of Big Data Solutions” section. Therefore,
this section can be very useful to identify which are the major challenges of
each domain and the most important aspects to be taken into account for each
domain. This analysis allowed us to perform some comparison and consider-
ation about the most commonly adopted tools in the different domains. Also in
the same session, the identified application domains are crossed with the solu-
tions analyzed in “Overview of Big Data Solutions” section, thus providing a
Tassonomy and Review of Big Data Solutions Navigation 65

shortcut to determine whose products have already been applied to a specific


field of application, that is, a hint for the development of future applications.
Finally, in the “Conclusions” section, conclusions are drawn.

Main Requirements and Features of Big Data Solutions


In this section, according to the above-reported short overview of Big Data
problems, we have identified a small number of main aspects that should
be addressed by architectures for management of Big Data problems. These
aspects can be regarded as a collection of major requirements to cope with
most of the issues related to the Big Data problems. We have divided the
identified main aspects in three main categories which, respectively, concern
with the infrastructure and the architecture of the systems that should cope
with Big Data; with the management of the large amount of data and charac-
teristics related to the type of physical storage; and with the accesses to data
and techniques of data analytics, such as ingestion, log analysis, and every-
thing else is pertinent to post-production phase of data processing. In some
cases, the features are provided and/or inherited by the operating system or
by the cloud/virtual infrastructure. Therefore, the specific Big Data solutions
and techniques have to be capable of taking advantage from the underlining
operating system and the infrastructure.

Infrastructral and Architectural Aspects


The typical Big Data solutions are deployed on cloud exploiting the flexibility
of the infrastructure. As a result, some of the features of Big Data solutions
may depend on the architecture and infrastructure facilities from which the
solution inherits/exploits the capabilities. Moreover, specific tool for data
gathering, processing, rendering, etc. may be capable or incapable of exploit-
ing a different range of cloud-based architectural aspects. For example, not
all databases can be distributed on multiple servers, not all algorithms can be
profitable remapped on a parallel architecture, not all data access or render-
ing solutions may exploit multilayered caches, etc. To this end, in the follow-
ing paragraphs, a set of main features are discussed, among them are the:
scalability, multitiered memory, availability, parallel and distributed process
management, workflow, self-healing, and data security, and privacy. A sum-
mary map is reported in the “Overview of Big Data Solutions” section.

Scalability
This feature may impact on the several aspects of the Big Data solution (e.g.,
data storage, data processing, rendering, computation, connection, etc.) and
66 Big Data Computing

has to cope with the capability of maintaining acceptable performances cop-


ing from small-to-large problems. In most cases, the scalability is obtained
by using distributed and/or parallel architectures, which may be allocated
on cloud. Both computing and storage resources can be located over a net-
work to create a distribute system where managing also the distribution of
workload.
As regards the computational scalability, processing a very huge data set
is important to optimize the workload, for example, with a parallel archi-
tecture, as proposed in Zenith project [Zenith], which may perform several
operations simultaneously (on an appropriate number of tasks), or provid-
ing a dynamic allocation of computation resources (i.e., a process releases
a resource as soon as it is no more needed, and thus it can be assigned to
another process) technique used in the ConPaas platform (Pierre et al., 2011).
Usually, the traditional computational algorithms are not scalable and thus
specific restructuring of the algorithms have to be defined and adopted.
On the other hand, not all the algorithms can take advantage by parallel
and/or distributed architectures for computing; specific algorithms have to
be defined, provided that an efficient parallel and/or distributed solution
exists. The evolution to distributed and parallel processing is just the first
step, since processes have to be allocated and managed in some parallel
architecture, which can be developed ad hoc or generally setup. Semantic
grid and parallel architectures can be used to the problem (Bellini et  al.,
2012b) [BlueGene].
Each system for Big Data may provide a scalable storage solution. In fact,
the main problem could be to understand in which measure a storage solu-
tion has to be scalable to satisfy the worst operative cases or the most com-
mon cases (and, in general, the most expensive cases). Moreover, for large
experiments, the data collection and processing may be not predictable with
high precision in the long term, for example, for the storage size and cost.
For example, it is not clear how much storage would be needed to collect
genomic information and EHR (Electronic Healthcare Records) for a unified
European health system in 5 or 10 years. In any case because EHR contains
a large amount of data, an interesting approach for their management could
be the use of a solution based on HBase that builds a system distributed,
fault-tolerant, and scalable database on clouds, built on top of the HDFS, with
random real-time read/write access to big data, overcoming the design lim-
its of traditional RDBMS (Yang et  al., 2011). Furthermore, to focus on this
problem, a predictive model to understand how will increase the need for
storage space should be made, while complexities and costs of this model are
high. In most cases, it is preferable to have a pragmatic approach, first guess
and work with the present problems by using cheap hardware and if neces-
sary, increase the storage on demand. This approach obviously cannot be
considered completely scalable, scalability is not just about the storage size,
and then remains the need to associate the solution presented, with a system
capable of scaling operationally (Snell, 2011).
Tassonomy and Review of Big Data Solutions Navigation 67

A good solution to optimize the reaction time and to obtain a scalable solu-
tion at limited costs is the adoption of a multitiered storage system, including
cache levels, where data pass from one level to another along the hierar-
chy of storage media having different response times and costs. In fact, a
multitier approach to storage, utilizing arrays of disks for all backup with
a primary storage and the adoption of an efficient file systems, allows us to
both provide backups and restores to online storage in a timely manner, as
well as to scale up the storage when primary storage grows. Obviously, each
specific solution does not have to implement all layers of the memory hier-
archy because their needs depend on the single specific case, together with
the amount of information to be accessed per second, the deepness of the
cache memories, binning in classes of different types of data based on their
availability and recoverability, or the choice to use a middleware to connect
separate layers. The structure of the multitiered storage can be designed on
the basis of a compromise from access velocity to general storage cost. The
multiple storages create as counterpart a large amount of maintenance costs.
Scalability may take advantage from the recent cloud solutions that imple-
ments techniques for dynamic and bursting on cloud storage and processes
from private to public clouds and among the latter. Private cloud computing
has recently gained much traction from both commercial and open-source
interests (Microsoft, 2012). For example, tools such as OpenStack [OpenStack
Project] can simplify the process of managing virtual machine resources. In
most cases, for small-to-medium enterprises, there is a trend to migrate mul-
titier applications into public cloud infrastructures (e.g., Amazon), which are
delegated to cope with scalability via elastic cloud solutions. A deep discus-
sion on cloud is out of the scope of this chapter.

High Availability
The high availability of a service (e.g., it may be referred to general service,
to storage, process, and network) is a key requirement in an architecture that
can affect the simultaneous use of a large number of users and/or compu-
tational nodes located in different geographical locations (Cao et al., 2009).
Availability refers to the ability of the community of users to access a system
exploiting its services. A high availability leads to increased difficulties in
guarantee data updates, preservations, and consistency in real time, and it is
fundamental that a user perceives, during his session, the actual and proper
reactivity of the system. To cope with these features, the design should be
fault-tolerant, as in redundant solution for data and computational capabili-
ties to make them highly available despite the failure of some hardware and
software elements of the infrastructure. The availability of a system is usu-
ally expressed as a percentage of time (the nines method) that a system is up
over a given period of time, usually a year. In cloud systems, for instance,
the level of 5 nines (99.999% of time means HA, high availability) is typically
related to the service at hardware level, and it indicates a downtime per year
68 Big Data Computing

of approximately 5 min, but it is important to note that time does not always
have the same value but it depends on the organization referred to by the
critical system. The present solutions obtain the HA score by using a range
of techniques of cloud architectures as fault-tolerant capabilities for virtual
machines, redundant storage for distributed database and balancing for the
front end, and the dynamic move of virtual machines.

Computational Process Management


The computational activities on Big Data may take a long time and may be
distributed on multiple computational computers/nodes on some parallel
architecture, in connection with some networking systems. Therefore, one
of the main characteristics of most of the Big Data solutions has to cope with
the needs of controlling computational processes by means of: allocating them
on a distributed system, putting them in execution on demand or periodi-
cally, killing them, recovering processing from failure, returning eventual
errors, scheduling them over time, etc. Sometimes, the infrastructure that
allows to put in execution parallel computational processes can work as a
service, thus it has to be accessible for multiple users and/or other multitier
architecture and servers. This means that sophisticated solutions for parallel
processing and scheduling are needed, including the definition of Service
Level Agreement (SLA) and in classical grid solutions. Example of solu-
tions to cope with these aspects are solutions for computational grid, media
grid, semantic computing, distributed processing such as AXCP media grid
(Bellini et  al., 2012c) and general grid (Foster et  al., 2002). The solution for
parallel data processing has to be capable of dynamically exploiting the
computational power of the underlining infrastructure, since most of the
Big Data problems may be computationally intensive for limited time slots.
Cloud solutions may help one to cope with the concepts of elastic cloud for
implementing dynamic computational solutions.

Workflow Automation
Big Data processes are typically formalized in the form of process work-
flows from data acquisition to results production. In some cases, the work-
flow is programed by using simple XML (Extensible Markup Language)
formalization or effective programing languages, for example, in Java,
JavaScript, etc. Related data may strongly vary in terms of dimensions and
data flow (i.e., variability): an architecture that handles well with both lim-
ited and large volumes of data, must be able to full support creation, organi-
zation, and transfer of these workflows, in single cast or broadcast mode. To
implement this type of architectures, sophisticated automation systems are
used. These systems work on different layers of the architecture through
applications, APIs (Application Program Interface), visual process design
environment, etc. Traditional Workflow Management Systems (WfMS) may
Tassonomy and Review of Big Data Solutions Navigation 69

not be suitable for processing a huge amount of data in real time, formal-
izing the stream processing, etc. In some Big Data applications, the high
data flow and timing requirements (soft real time) have made inadequate
the traditional paradigm “store-then-process,” so that the complex event pro-
cessing (CEP) paradigms are proposed (Gulisano et al., 2012): a system that
processes a continuous stream of data (event) on the fly, without any stor-
age. In fact, the CEP can be regarded as an event-driven architecture (EDA),
dealing with the detection and production of reaction to events, that spe-
cifically has the task of filtering, match and aggregate low-level events in
high-level events. Furthermore, creating a parallel-distributed CEP, where
data are partitioned across processing nodes, it is possible to realize an
elastic system capable of adapting the processing resources to the actual
workload reaching the high performance of parallel solutions and over-
coming the limits of scalability.
An interesting application example is the Large Hadron Collider (LHC),
the most powerful particle accelerator in the world, that is estimated to pro-
duce 15 million gigabytes of data every year [LHC], then made available to
physicists around the world thanks to the infrastructure support “worldwide
LHC computing grid” (WLCG). The WLCG connects more than 140 computing
centers in 34 countries with the main objective to support the collection and
storage of data and processing tools, simulation, and visualization. The idea
behind the operation requires that the LHC experimental data are recorded
on tape at CERN before being distributed to 11 large computer centers (cen-
ters called “Tier 1”) in Canada, France, Germany, Italy, the Netherlands,
Scandinavia, Spain, Taiwan, the UK, and the USA. From these sites, the data
are made available to more than 120 “Tier-2” centers, where you can conduct
specific analyses. Individual researchers can then access the information
using computer clusters or even their own personal computer.

Cloud Computing
The cloud capability allows one to obtain seemingly unlimited storage space
and computing power that it is the reason for which cloud paradigm is
considered a very desirable feature in each Big Data solution (Bryant et al.,
2008). It is a new business where companies and users can rent by using
the “as a service” paradigm infrastructure, software, product, processes,
etc., Amazon [Amazon AWS], Microsoft [Microsoft Azure], Google [Google
Drive]. Unfortunately, these public systems are not enough to extensive
computations on large volumes of data, due to the low bandwidth; ideally a
cloud computing system for Big Data should be geographically dispersed, in
order to reduce its vulnerability in the case of natural disasters, but should
also have a high level of interoperability and data mobility. In fact, there are
systems that are moving in this direction, such as the OpenCirrus project
[Opencirrus Project], an international test bed that allows experiments on
interlinked cluster systems.
70 Big Data Computing

Self-Healing
This feature refers to the capability of a system to autonomously solve the fail-
ure problems, for example, in the computational process, in the database and
storage, and in the architecture. For example, when a server or a node fails, it
is important to have the capability of automatically solve the problem to avoid
repercussions on the entire architecture. Thus, an automated recovery from
failure solution that may be implemented by means of fault-tolerant solutions,
balancing, hot spare, etc., and some intelligence is needed. Therefore, it is
an important feature for Big Data architectures, which should be capable of
autonomously bypassing the problem. Then, once informed about the prob-
lems and the performed action to solve it, the administrator may perform an
intervention. This is possible, for example, through techniques that automati-
cally redirected to other resources, the work that was planned to be carried out
by failed machine, which has to be automatically put offline. To this end, there
are commercial products which allow setting up distributed and balanced
architecture where data are replicated and stored in clusters geographically
dispersed, and when a node/storage fails, the cluster can self-heal by recreat-
ing the missing data from the damage node, in its free space, thus reconstruct-
ing the full capability of recovering from the next problem. On the contrary,
the breakdown results and capacity may decrease in the degraded conditions
until the storage, processor, resource is replaced (Ghosh et al., 2007).

Data Management Aspects


In the context of data management, a number of aspects characterize the Big
Data solutions, among them: the maximum size of the database, the data
models, the capability of setting up distributed and clustered data man-
agement solutions, the sustainable rate for the data flow, the capability of
partitioning the data storage to make it more robust and increase perfor-
mance, the query model adopted, the structure of the database (relational,
RDF (resource description framework), reticular, etc.), etc. Considering data
structures for Big Data, there is a trend to find a solution using the so-called
NoSQL databases (NoSQL, Simple Query Language), even if there are good
solutions that still use relational database (Dykstra, 2012). In the market and
from open source solutions, there are several different types of NoSQL data-
bases and rational reasons to use them in different situations, for different
kinds of data. There are many methods and techniques for dealing with Big
Data, and in order to be capable of identifying the best choice in each case,
a number of aspects have to be taken into account in terms of architecture
and hardware solutions, because different choices can also greatly affect the
performance of the overall system to be built. Related to the database per-
formance and data size, there is the so-called CAP Theorem that plays a
relevant role (Brewer, 2001, 2012). The CAP theorem states that any distrib-
uted storage system for sharing data can provide only two of the three main
Tassonomy and Review of Big Data Solutions Navigation 71

features: consistency, availability, and partition tolerance (Fox and Brewer, 1999).
Property of consistency states that a data model after an operation is still in
a consistent state providing the same data to all its clients. The property of
availability means that the solution is robust with respect to some internal
failure, that is, the service is still available. Partition tolerance means that
the system is going to continue to provide service even when it is divided in
disconnected subsets, for example, a part of the storage cannot be reached.
To cope with CAP theorem, Big Data solutions try to find a trade-off between
continuing to issue the service despite of problems of partitioning and at
the same time attempting to reduce the inconsistencies, thus supporting the
­so-called eventual consistency.
Furthermore, in the context of relational database, the ACID (Atomicity,
Consistency, Isolation and Durability) properties describe the reliability of
database transactions. This paradigm does not apply to NoSQL database
where, in contrast to ACID definition, the data state provides the so-called
BASE property: Basic Available, Soft state, and Eventual consistent. Therefore,
it is typically hard to guaranteed an architecture for Big Data management
in a fault-tolerant BASE way, since, as the Brewer’s CAP theorem says, there
is no other choice to make a compromise if you want to scale up. In the fol-
lowing, some of the above aspects are discussed and better explained.

Database Size
In Big Data problems, the database size may easily reach magnitudes like
hundreds of Tera Byte (TB), Peta Byte (PB), or Exa Byte (EB). The evolution
of Big Data solutions has seen an increment of the amounts of data that can
be managed. In order to exploit these huge volumes of data and to improve
the productivity of scientific, new technologies, new techniques are needed.
The real challenge of database size are related to the indexing and to access
at the data. These aspects are treated in the following.

Data Model
To cope with huge data sets, a number of different data models are available
such as Relational Model, Object DB, XML DB, or Multidimensional Array
model that extend database functionality as described in Baumann et  al.
(1998). Systems like Db4o (Norrie et  al., 2008) or RDF 3X (Schramm, 2012)
propose different solutions for data storage can handle structured informa-
tion or less and the relationships among them. The data model represents
the main factor that influences the performance of the data management. In
fact, the performance of indexing represents in most cases the bottleneck of
the elaboration. Alternatives may be solutions that belong to the so-called
category of NoSQL databases, such as ArrayDBMS (Cattel, 2010), MongoDB
[mongoDB], CouchDB [Couchbase], and HBase [Apache HBase], which pro-
vide higher speeds with respect to traditional RDBMS (relational database
72 Big Data Computing

management systems). Within the broad category of NoSQL database, large


NoSQL families can be identified, which differ from each other for storage
and indexing strategy:

• Key-value stores: high scalable solution, which allows one to obtain


good speed in the presence of large lists of elements, such as stock
quotes; examples are Amazon Dynamo [Amazon Dynamo] and
Oracle Berkeley [Oracle Berkeley].
• Wide column stores (big tables): are databases in which the columns
are grouped, where keys and values can be composed (as HBase
[Apache HBase], Cassandra [Apache Cassandra]). Very effective to
cope with time series and with data coming from multiple sources,
sensors, device, and website, needing high speed. Consequently,
they provide good performance in reading and writing operations,
while are less suitable for data sets where the data have the same
importance of the data relationships.
• Document stores: are aligned with the object-oriented programing,
from clustering to data access, and have the same behavior of key-
value stores, where the value is the document content. They are
useful when data are hardly representable with a relational model
due to high complexity; therefore, are used with medical records
or to cope with data coming from social networks. Examples are
MongoDB [mongoDB] and CouchDB [Couchbase].
• Graph databases: they are suitable to model relationships among
data. The access model is typically transactional and therefore suit-
able for applications that need transactions. They are used in fields
such as geospatial, bioinformatics, network analysis, and recom-
mendation engines. The execution of traditional SQL queries is not
simple. Examples are: Neo4J [Neo4j], GraphBase [GraphBase], and
AllegroGraph [AllegroGraph].

Other NoSQL database categories are: object databases, XML databases,


multivalue databases, multimodel databases, multidimensional database,
etc. [NoSQL DB].
It is therefore important to choose the right NoSQL storage type during
the design phase of the architecture to be implemented, considering the dif-
ferent features that characterize the different databases. In other words, it is
very important to use the right tool for each specific project, because each
storage type has its own weaknesses and strengths.

Resources
The main performance bottlenecks for NoSQL data stores correspond to
main computer resources: network, disk and memory performance, and the
Tassonomy and Review of Big Data Solutions Navigation 73

computational capabilities of the associated CPUs. Typically, the Big Data


stores are based on clustering solutions in which the whole data set is par-
titioned in clusters comprising a number of nodes, cluster size. The number
of nodes in each cluster affects the completion times of each job, because
a greater number of nodes in a cluster corresponds to a lower completion
time of the job. In this sense, also the memory size and the computational
capabilities of each node influence the node performance (De Witt et al., 2008).
Most of the NoSQL databases use persistent socket connections; while disk
is always the slowest component for the inherent latency of non-volatile
storages. Thus, any high-performance database needs to have some form of
memory caching or memory-based storage to optimize the memory perfor-
mance. Another key point is related to the memory size and usage of the solu-
tion selected. Some solutions, such as HBase [Apache HBase], are considered
memory-intensive, and in these cases a sufficient amount of memory on each
server/node has to be guaranteed to cover the needs of the cluster that are
located in its region of interest. When the amount of memory is insufficient,
the overall performance of the system would drastically decrease (Jacobs,
2009). The network capability is an important factor that affects the final per-
formance of the entire Big Data management. In fact, network connections
among clusters make extensive use during read and write operations, but
there are also algorithms like Map-Reduce, that in shuffle step make up a
high-level network usage. It is therefore important to have a highly available
and resiliency network, which is also able to provide the necessary redun-
dancy and that could scale well, that is, it allows the growth of the number
of clusters.

Data Organization
The data organization impacts on storage, access, and indexing perfor-
mance of data (Jagadish et  al., 1997). In most cases, a great part of data
accumulated are not relevant for estimating results and thus they could be
filtered out and/or stored in compressed size, as well as moved into slower
memory along the multitier architecture. To this end, a challenge is to define
rules for arranging and filtering data in order to avoid/reduce the loss of use-
ful information preserving performances and saving costs (Olston et  al.,
2003). The distribution of data in different remote tables may be the cause
of inconsistencies when connection is lost and the storage is partitioned for
some fault. In general, it is not always possible to ensure locally available data
on the node that would process them. It is evident that if this condition is
generally achieved, the best performance would be obtained. Otherwise, it
would be needed to retrieve the missed data blocks, to transfer them and
process them in order to produce the results with a high consumption of
resources on the node requested them and on the node that owns them,
and thus on the entire network; therefore, the time of completion would be
significantly higher.
74 Big Data Computing

Data Access for Rendering


The activity of data rendering is related to the access of data for represent-
ing them to the users, and in some cases by performing some prerender-
ing processing. The presentation of original or produced data results may
be a relevant challenge when the data size is so huge that their processing
for producing a representation can be highly computational-intensive, and
most of the single data would not be relevant for the final presentation to
the user. For example, representing at a glance the distribution of 1 billion
of economical transactions on a single image would be in any way limited
to some thousands of points; the presentation of the distribution of people
flows in the large city would be based on the analysis of several hundreds
of millions of movements, while their representation would be limited on
presenting a map on an image of some Mbytes. A query on a huge data set
may produce enormous set of results. Therefore, it is important to know in
advance their size and to be capable of analyzing Big Data results with scal-
able display tools that should be capable of producing a clear vision in a
range of cases, from small-to-huge set of results. For example, the node-link
representation of the RDF graph does not provide a clear view of the overall
RDF structure: one possible solution to this problem is the use of a 3D adja-
cency matrix as an alternate visualization method for RDF (Gallego et  al.,
2011). Thanks to some graph display tools, it is possible to highlight specific
data aspects. Furthermore, it should be possible to guarantee efficient access,
perhaps with the definition of standard interfaces especially in business and
medical applications on multichannel and multidevice delivering of results
without decreasing data availability. An additional interesting feature for
data access can be the save of user experience in data access and naviga-
tion (parameters and steps for accessing and filtering them). The adoption
of semantic queries in RDF databases is essential for many applications that
need to produce heterogeneous results and thus in those cases the data ren-
dering is very important for presenting them and their relationships. Other
solutions for data access are based on the production of specific indexes, such
as Solr [Apache Solr] or in NoSQL databases. An example are the production
of faceted results, in which the query results are divided into multiple cate-
gories on which the user can further restrict the search results, by composing
by using “and”/“or” different facets/filters. This important feature is present
in solutions such as RDF-HDT Library, eXist project (Meier, 2003).

Data Security and Privacy


The problem of data security is very relevant in the case of Big Data solu-
tions. The data to be processed may contain sensitive information such as
EPR, bank data, general personal information as profiles, and content under
IPR (intellectual property rights) and thus under some licensing model.
Therefore, sensitive data cannot be transmitted, stored, or processed in clear,
Tassonomy and Review of Big Data Solutions Navigation 75

and thus have to be managed in some coded protected format, for exam-
ple, with some encryption. Solutions based on conditional access, channel
protection, and authentication may still have sensible data stored in clear
into the storage. They are called Conditional Access Systems (CAS) and are
used to manage and control the user access to services and data (normal
users, administrator, etc.) without protecting each single data element via
encryption. Most Big Data installations are based on web services models,
with few facilities for countering web threats, whereas it is essential that
data are protected from theft and unauthorized accesses. While, most of the
present Big Data solutions present only conditional access methods based on
credentials only for accessing the data information and not to protect them
with encrypted packages. On the other hand, content protection technolo-
gies are sometimes supported by Digital Rights Management (DRM), solu-
tions that allow to define and execute licenses that formalize the rights that
can be exploited on a given content element, who can exploit that rights and
at which condition (e.g., time, location, number of times, etc.). The control
of the user access rights is per se a Big Data problem (Bellini et  al., 2013).
The DRM solutions use authorization, authentication, and encryption tech-
nologies to manage and enable the exploitation of rights at different types
of users; logical control of some users with respect to each single pieces of
the huge quantities of data. The same technology can be used to provide
contribution to safeguard the data privacy allowing keeping the encrypted
data until they are effectively used by authorized and authenticated tools
and users. Therefore, the access to data outside permitted rights and con-
tent would be forbidden. Data security is a key aspect of architecture for the
management of such big quantities of data and is excellent to define who can
access to what. This is a fundamental feature in some areas such as health/
medicine, banking, media distribution, and e-commerce. In order to enforce
data protection, some frameworks are available to implement DRM and/
or CAS solutions exploiting different encryption and technical protection
techniques (e.g., MPEG-21 [MPEG-21], AXMEDIS (Bellini et al., 2007), ODRL
(Iannella, 2002)). In the specific case of EPR, several millions of patients with
hundreds of elements have to be managed; where for each of them some tens
of rights should to be controlled, thus resulting in billions of accesses and
thus of authentications per day.

Data Analytics Aspects


Data analysis aspects have to do with a large range of different algorithms
for data processing. The analysis and review of the different data analytic
algorithms for Big Data processing is not in the focus of this chapter that
aims at analyzing the architectural differences and the most important fea-
tures of Big Data solutions. On the other hand, the data analytic algorithms
may range on data: ingestion, crawling, verification, validation, mining, pro-
cessing, transcoding, rendering, distribution, compression, etc., and also for
76 Big Data Computing

the estimation of relevant results such as the detection of unexpected corre-


lations, detection of patterns and trends (for example, of events), estimation
of collective intelligence, estimation of the inception of new trends, predic-
tion of new events and trends, analysis of the crowd sourcing data for senti-
ment/affective computing with respects to market products or personalities,
identification of people and folk trajectories, estimation of similarities for
producing suggestion and recommendations, etc. In most of these cases, the
data analytic algorithms have to take into account of user profiles, content
descriptors, contextual data, collective profiles, etc.
The major problems of Big Data are related to how their “meanings” are
discovered; usually this research occurs through complex modeling and ana-
lytics processes: hypotheses are formulated, statistical, visual, and semantic
models are implemented to validate them, and then new hypotheses are for-
mulated again to take deductions, find unexpected correlations, and produce
optimizations. Also, in several of these cases, the specific data analytic algo-
rithms are based on statistical data analysis; semantic modeling, reasoning,
and queries; traditional queries; stream and signal processing; optimization
algorithms; pattern recognition; natural language processing; data cluster-
ing; similarity estimation; etc.
In the following, key aspects are discussed and better explained.

Data Mining/Ingestion
Aspects are two key features in the field of Big Data solutions; in fact, in
most cases there is a trade-off between the speed of data ingestion, the abil-
ity to answer queries quickly, and the quality of the data in terms of update,
coherence, and consistency. This compromise impacts the design of the stor-
age system (i.e., OLTP vs OLAP, On-Line Transaction Processing vs On-Line
Analytical Processing), that has to be capable of storing and index the new
data at the same rate at which they reach the system, also taking into account
that a part of the received data could not be relevant for the production of
requested results. Moreover, some storage and file systems are optimized to
read and others for writing; while workloads generally involve a mix of both
these operations. An interesting solution is GATE, a framework and graphical
development environment to develop applications and engineering compo-
nents for language processing tasks, especially for data mining and infor-
mation extraction (Cunningham et al., 2002). Furthermore, the data mining
process can be strengthened and completed by the usage of crawling techniques,
now consolidated in the extraction of meaningful data from web pages richer
information, also including complex structures and tags. The processing of a
large amount of data can be very expensive in terms of resources used and
computation time. For these reasons, it may be helpful to use a distributed
approach of crawlers (with additional functionality) who works as distrib-
uted system under, with a central control unit which manages the allocation
of tasks between the active computers in the network (Thelwall, 2001).
Tassonomy and Review of Big Data Solutions Navigation 77

Another important feature is the ability to get advanced faceted results


from queries on the large volumes of available data: this type of queries
allows the user to access the information in the store, along multiple explicit
dimensions, and after the application of multiple filters. This interaction
paradigm is used in mining applications and allows to analyze and browse
data across multiple dimensions; the faceted queries are especially useful in
e-commerce websites (Ben-Yitzhak et al., 2008). In addition to the features
already seen, it is important to take into account the ability to process data
in real time: today, in fact, especially in business, we are in a phase of rapid
transition; there is also the need to faster reactions, to be able to detect pat-
terns and trends in a short time, in order to reduce the response time to
customer requests. This increases the need to evaluate information as soon
as an event occurs, that is, the company must be able to answer questions on
the fly according to real-time data.

Data Access for Computing


The most important enabling technologies are related to the data model-
ing and to the data indexing. Both these aspects should be focused on fast
access/retrieve data in a suitable format to guarantee high performance
in the execution of the computational algorithms to be used for produc-
ing results. The type of indexing may influence the speed of data retrieval
operations at only cost of an increased storage space. Couchbase [Couchbase]
offers an incremental indexing system that allows an efficient access to data
at multiple points. Another interesting method is the use of Hfile (Aiyer
et al., 2012) and the already mentioned Bloom filters (Borthakur et al., 2011).
It consists of an index-organized data file created periodically and stored
on disk. However, in Big Data context, there is the need to manage often
irregular data, with a heterogeneous structure and do not follow any pre-
defined schema. For these reasons could be interesting the application of
an alternative indexing technique suitable for semistructured or unstruc-
tured data as proposed in McHugh et al. (1998). On the other hand, where
data come from different sources, to establish relationships among datasets,
allows data integration and can lead to determine additional knowledge and
deductions. Therefore, the modeling and management of data relationships
may become more important than the data, especially where relationships
play a very strong role (social networks, customer management). This is the
case of new data types for social media that are formalized as highly interre-
lated content for which the management of multi-dimensional relationships
in real time is needed. A possible solution is to store relationships in specific
data structures that ensure good ability to access and extraction in order
to adequately support predictive analytics tools. In most cases, in order to
guarantee the demanded performance in the rendering and production of
data results, a set of precomputed partial results and/or indexes can be esti-
mated. These precomputed partial results should be stored into high-speed
78 Big Data Computing

caches stores, as temporary data. Some kinds of data analytic algorithms create
enormous amounts of temporary data that must be opportunely managed
to avoid memory problems and to save time for the successive computa-
tions. In other cases, however, in order to make some statistics on the infor-
mation that is accessed more frequently, it is possible to use techniques to
create well-defined cache system or temporary files to optimize the com-
putational process. With the same aim, some incremental and/or hierarchical
algorithms are adopted in combination of the above-mentioned techniques,
for example, the hierarchical clustering k-means and k-medoid for recom-
mendation (Everitt et al., 2001; Xui and Wunsch, 2009; Bellini et al., 2012c).
A key element of Big Data access for data analysis is the presence of metadata
as data descriptors, that is, additional information associated with the main
data, which help to recover and understand their meaning with the context.
In the financial sector, for example, metadata are used to better understand
customers, date, competitors, and to identify impactful market trends; it is
therefore easy to understand that having an architecture that allows the stor-
age of metadata also represents a benefit for the following operations of data
analysis. Structured metadata and organized information help to create a
system with more easily identifiable and accessible information, and also
facilitate the knowledge identification process, through the analysis of avail-
able data and metadata. A variety of attributes can be applied to the data,
which may thus acquire greater relevance for users. For example, keyword,
temporal, and geospatial information, pricing, contact details, and anything
else that improves the quality of the information that has been requested. In
most cases, the production of suitable data descriptors could be the way to
save time in recovering the real full data, since the matching and the further
computational algorithms are based on those descriptors rather than on the
original data. For example, the identification duplicated documents could be
performed by comparing the document descriptors, the production of user
recommendations can be performed on the basis of collective user descrip-
tors or on the basis of the descriptors representing the centre of the clusters.

Overview of Big Data Solutions


In this section, a selection of representative products for the implementation
of different Big Data systems and architectures has been analyzed and orga-
nized in a comparative table on the basis of the main features identified in
the previous sections. To this end, the following paragraphs provide a brief
overview of these considered solutions as reported in Table 2.1, described in
the next section.
The ArrayDBMS extends database services with query support and a
multi-dimensional array modeling, because Big Data queries often involve
Tassonomy and Review of Big Data Solutions Navigation 79

a high number of operations, each of which is applied to a large number


of elements in the array. In these conditions, the execution time with tra-
ditional database would be unacceptable. In the literature, and from the
real applications, a large number of examples are available that use various
types of ArrayDBMS, and among them, we can recall a solution that is based
on ArrayDBMS Rasdaman (Baumann et al., 1998): different from the other
types, Rasdaman ArrayDBMS provides support for domain-independent
arrays of arbitrary size and it uses a general-purpose declarative query lan-
guage, that is also associated with an optimized internal execution, transfer,
and storage. The conceptual model consists of arrays of any size, measures,
and types of cells, which are stored in tables named collections that con-
tain an OID (object ID) column and an array column. The RaSQL language
offers expressions in terms of multi-dimensional arrays of content objects.
Following the standard paradigm Select-from-where, firstly the query process
gathers collections inspected, then the “where” clause filters the array cor-
responding to the predicate, and finally, the “Select” prepares the matrices
derived from initial query. Internally, Rasdaman decomposes each object
array in “tiles” that form the memory access units, the querying units and the
processing units. These parts are stored as BLOBs (Binary Large Object) in a
relational database. The formal model of algebra for Rasdaman arrays offers
a high potential for query optimization. In many cases, where phenomena
are sampled or simulated, the results are data that can be stored, searched,
and submitted as an array. Typically, the data arrays are outlined by meta-
data that describe them; for example, geographically referenced images may
contain their position and the reference system in which they are expressed.

Couchbase
[Couchbase] is designed for real-time applications and does not support SQL
queries. Its incremental indexing system is realized to be native to JSON
(JavaScript Object Notation) storage format. Thus, JavaScript code can be
used to verify the document and select which data are used as index key.
Couchbase Server is an elastic and open-source NoSQL database that auto-
matically distributes data across commodity servers or virtual machines
and can easily accommodate changing data management requirements,
thanks to the absence of a schema to manage. Couchbase is also based on
Memcached which is responsible for the optimization of network proto-
cols and hardware, and allows obtaining good performance at the network
level. Memcached [Memcached] is an open-source distributed caching sys-
tem based on main memory, which is specially used in high trafficked web-
sites and high-performance demand web applications. Moreover, thanks
to Memcached, CouchBase can improve its online users experience main-
taining low latency and good ability to scale up to a large number of users.
CouchBase Server allows managing in a simple way system updates, which
can be performed without sending offline the entire system. It also allows
80 Big Data Computing

Table 2.1
Main Features of Reviewed Big Data Solutions
Google
ArrayDBMS CouchBase Db4o eXist MapReduce Hadoop

Infrastructural and Architectural Aspects

Distributed Y Y Y A Y Y

High availability A Y Y Y Y Y

Process Computation- Auto distribution NA So high for update Configurable Configurable


management insensitive of entire files

Cloud A Y A Y/A Y Y

Parallelism Y Y Transactional Y Y Y

Data Management Aspects

Data dimension 100 PB PB 254 GB 231doc. 10 PB 10 PB

Traditional/not NoSQL NoSQL NoSQL NoSQL NoSQL NoSQL


traditional

SQL Good SQL-like language A A A Low


interoperability

Data organization Blob Blob 20 MB Blob NA Blob Chunk

Data model Multidim. array 1 document/concept object DB + B-tree for XML-DB + index Big table (CF, Big table
(document store) index tree KV, OBJ, (column
DOC) family)

Memory footprint Reduces Documents + Objects + index Documents + NA NA


bidimensional index index

Users access type Web interface Multiple point Remote user interface Web interface, Many types of API, common
REST interface interface line interface
or HDFS-UI
web app

Data access Much higher if Speed up access to a Various techniques for NA NA NA


performance metadata are document by optimal data access
stored in DBMS automatically performance
caching

Data Analitycs Aspects

Type of indexing Multimensional Y (incremental) B-Tree field indexes B+-tree (XISS) Distributed HDFS
index multilevel
tree indexing

Data relationships Y NA Y Y Y A

Visual rendering Y (rView) NA Y NA P A


and
visualization

Faceted query NA A P Y A (Lucene) A (Lucene)

Statistical analysis Y Y/A A (optional library) A (JMXClient) A Y


tools

Log analysis NA Y NA NA Y Y

Semantic query P A (elastic search) P A A P (index for


semantic
search)

Indexing speed More than Non-optimal 5–10 times more than SQL High speed with Y/A A
RDBMS performance B+Tree

Real-time NA Indexing + creating More suitable for NA A A


processing view on the fly real-time processing of (streambased,
events Hstreaming)

Note:  Y, supported; N, no info; P, partially supported; A, available but supported by means a plug-in or external extension; NA, not available.
Tassonomy and Review of Big Data Solutions Navigation 81

RdfHdt
HBase Hive MonetDB MongoDB Objectivity OpenQM Library RDF 3X

Y Y Y Y Y NA A Y

Y Y P Y Y Y A A

Write- Read-dominated (or Read-dominated More Possibility Divided among NA Optimized


intensive rapidly change) more processes

Y Y A Y Y NA A A

Y Y Y Y Y NA NA Y

PB PB TB PB TB 16 TB 100 mil Triples (TB) 50 mil Triples

NoSQL NoSQL SQL NoSQL XML/SQL++ NoSQL NoSQL NoSQL

A Y Y JOIN not support Y (SQL++) NA Y Y

A Bucket Blob Chunk 16 MB Blob Chunk 32 KB NA Blob

Big table Table— BAT (Ext SQL) 1 Table for each Classic Table in 1 file/table (data 3 structures, RDF 1 Table +
(column partitions— collection which define + + dictionary) graph for Header permutations
family) Bucket (column (document DB) models (MultivalueDB) (RDF store) (RDF store)
family) (GraphDB)

Optimized use NA Efficient use Document + Like RDBMS No compression −50% data set Less than data
Metadata set

Jython or scala HiveQL queries Full SQL interfaces Command Line, Multiple access Console or web Access on demand Web interface
interface, Web Interface from different application (SPARQL)
rest or thrift query application,
gateway AMS

NA Accelerate queries Fast data access Over 50 GB, 10 Not provide any NA NA NA
with bitmap (MonetDB/ times faster optimization for
indices XQuery is among than MySQL accessing replicas
the fastest and
mostscalable)

h-files Bitmap indexing Hash index Index RDBMS Y (function e B-tree based RDF-graph Y (efficient
like objectivity/ triple
SQL++ interfaces) indexes)

Y NA Y Y Y Y Y Y

NA NA NA A A (PerSay) Y Y P

A (filter) NA NA NA Y (Objectivity PQE) NA Y NA

Y y A (SciQL) Y (network Y A A Y
traffic)

P Y NA A Y NA Y Y

Y NA NA NA NA NA Y Y

Y NA More than RDBMS High speed if DB High speed Increased speed 15 times faster than Aerodynamic
dimension with alternate RDF
doesnot exceed key
memory

A (HBaseHUT A (Flume + Hive NA Y Y (release 8) NA NA NA


library) indexes our data
and can be
queried in
real-time)
82 Big Data Computing

realizing a reliable and highly available storage architecture, thanks to the


multiple copies of the data stored within each cluster.
The db4o is an object-based database (Norrie et al., 2008) which provides
a support to make application objects persistent. It also supports various
forms of querying over these objects such as query expression trees and iter-
ator query methods, query-by-example mechanisms to retrieve objects. Its
advantages are the simplicity, speed, and small memory footprint.

eXist
eXist (Meier, 2003) is grounded on an open-source project to develop a native
XML database system that can be integrated into a variety of possible appli-
cations and scenarios, ranging from web-based applications to documenta-
tion systems. The eXist database is completely written in Java and maybe
deployed in different ways, either running inside a servlet-engine as a stand-
alone server process, or directly embedded into an application. eXist pro-
vides schema-less storage of XML documents in hierarchical collections. It
is possible to query a distinct part of collection hierarchy, using an extended
XPath syntax, or the documents contained in the database. The eXist’s query
engine implements efficient, index-based query processing. According to
path join algorithms, a large range of queries are processed using index infor-
mation. This database is an available solution for applications that deal with
both large and small collections of XML documents and frequent updates of
them. eXist also provides a set of extensions that allow to search by keyword,
by proximity to the search terms, and by regular expressions.

Google Map-Reduce
Google Map-Reduce (Yang et al., 2007) is the programing model for process-
ing Big Data used by Google. Users specify the computation in terms of a map
and a reduction function. The underlying system parallelizes the computa-
tion across large-scale clusters of machines and is also responsible for the fail-
ures, to maintain effective communication and the problem of performance.
The Map function in the master node takes the inputs, partitioning them into
smaller subproblems, and distributes them to operational nodes. Each opera-
tional node could perform this again, creating a multilevel tree structure. The
operational node processes the smaller problems and returns the response to
its parent node. In the Reduce function, the root node takes the answers from
the subproblems and combine them to produce the answer at the global prob-
lem is trying to solve. The advantage of Map-Reduce consists in the fact that it
is intrinsically parallel and thus it allows to distribute processes of mapping
operations and reduction. The operations of Map are independent of each
other, and can be performed in parallel (with limitations given from the data
source and/or the number of CPU/cores near to that data); in the same way,
a series of Reduce can perform the reduction step. This results in running
Tassonomy and Review of Big Data Solutions Navigation 83

queries or other highly distributable algorithms potentially in real time, that


is a very important feature in some work environments.

Hadoop
[Hadoop Apache Project] is a framework that allows managing distributed
processing of Big Data across clusters of computers using simple program-
ing models. It is designed to scale up from single servers to thousands of
machines, each of them offering local computation and storage. The Hadoop
library is designed to detect and handle failures at the application layer, so
delivering a highly available service on top of a cluster of computers, each of
which may be prone to failures. Hadoop was inspired from Google’s Map-
Reduce and Google File System, GFS, and in practice it has been realized to
be adopted in a wide range of cases. Hadoop is designed to scan large data
set to produce results through a distributed and highly scalable batch pro-
cessing systems. It is composed of the Hadoop Distribute File System (HDFS)
and of the programing paradigm Map-Reduce (Karloff et al., 2010); thus, it is
capable of exploiting the redundancy built into the environment. The pro-
graming model is capable of detecting failures and solving them automati-
cally by running specific programs on various servers in the cluster. In fact,
redundancy provides fault tolerance and capability to self-healing of the
Hadoop Cluster. HDFS allows applications to be run across multiple servers,
which have usually a set of inexpensive internal disk drives; the possibility
of the usage of common hardware is another advantage of Hadoop. A similar
and interesting solution is HadoopDB, proposed by a group of researchers
at Yale. HadoopDB was conceived with the idea of creating a hybrid system
that combines the main features of two technological solutions: parallel data-
bases in performance and efficiency, and Map-Reduce-based system for scal-
ability, fault tolerance, and flexibility. The basic idea behind HadoopDB is to
use Map-Reduce as the communication layer above multiple nodes running
single-node DBMS instances. Queries are expressed in SQL and then trans-
lated into Map-Reduce. In particular, the solution implemented involves the
use of PostgreSQL as database layer, Hadoop as a communication layer, and
Hive as the translation layer (Abouzeid et al., 2009).

Hbase
Hbase (Aiyer et al., 2012) is a large-scale distributed database build on top of
the HDFS, mentioned above. It is a nonrelational database developed by means
of an open source project. Many traditional RDBMSs use a single mutating
B-tree for each index stored on disk. On the other hand, Hbase uses a Log
Structured Merge Tree approach: first collects all updates into a special data
structure on memory and then, periodically, flush this memory on disk, creating
a new index-organized data file, the called also Hfile. These indices are immu-
table over time, while the several indices created on the disk are periodically
84 Big Data Computing

merged. Therefore, by using this approach, the writing to the disk is sequen-
tially performed. HBase’s performance is satisfactory in most cases and may
be further improved by using Bloom filters (Borthakur et al., 2011). Both HBase
and HDFS systems have been developed by considering elasticity as funda-
mental principle, and the use of low cost disks has been one of the main goals
of HBase. Therefore, to scale the system results is easy and cheap, even if it has
to maintain a certain fault tolerance capability in the individual nodes.

Hive
[Apache Hive] is an open-source data warehousing solution based on top of
Hadoop. Hive has been designed with the aim of analyzing large amounts
of data more productively, improving the query capabilities of Hadoop. Hive
supports queries expressed in an SQL-like declarative language—HiveQL—
to extract data from sources such as HDFS or HBase. The architecture is
divided into: Map-Reduce paradigm for computation (with the ability for
users to enrich the queries with custom Map-Reduce scripts), metadata
information for a data storage, and a processing part that receives a query
from user or applications for execution. The core in/out libraries can be
expanded to analyze customized data formats. Hive is also characterized
by the presence of a system catalog (Metastore) containing schemas and sta-
tistics, which is useful in operations such as data exploration, query optimi-
zation, and query compilation. In Facebook, the Hive warehouse contains
tens of thousands of tables and stores over 700 TB of data and is being used
extensively for both reporting and ad-hoc analyses by more than 200 users
per month (Thusoo et al., 2010).

MonetDB
MonetDB (Zhang et al., 2012) is an open-source DBMS for data mining appli-
cations. It has been designed for applications with large databases and que-
ries, in the field of Business Intelligence and Decision Support. MonetDB has
been built around the concept of bulk processing: simple operations applied
to large volumes of data by using efficient hardware, for large-scale data pro-
cessing. At present, two versions of MonetDB are available and are working
with different types of databases: MonetDB/SQL with relational database,
and MonetDB/XML with an XML database. In addition, a third version is
under development to introduce RDF and SPARQL (SPARQL Protocol and
RDF Query Language) supports. MonetDB provides a full SQL interface and
does not allow a high-volume transaction processing with its multilevel ACID
properties. The MonetDB allows performance improvement in terms of speed
for both relational and XML databases thanks to innovations introduced at
DBMS level, a storage model based on vertical fragmentation, run-time query
optimization, and on modular software architecture. MonetDB is designed
to take advantage of the large amount of main memory and implements new
Tassonomy and Review of Big Data Solutions Navigation 85

techniques for an efficient support of workloads. MonetDB represents rela-


tional tables using the vertical fragmentation (column-stores), storing each
column in a separate table, called BAT (Binary Association Table). The left
column, usually the OID (object-id), is called the head and the right column,
which usually contains the actual attribute values, is called the tail.

MongoDB
[MongoDB] is a document-oriented database that memorizes document data
in BSON, a binary JSON format. Its basic idea consists in the usage of a more
flexible model, like the “document,” to replace the classic concept of a “row.” In
fact, with the document-oriented approach, it is possible to represent complex
hierarchical relationships with a single record, thanks to embedded docu-
ments and arrays. MongoDB is open-source and it is schema-free—that is,
there is no fixed or predefined document’s keys—and allows defining indices
based on specific fields of the documents. In order to retrieve data, ad-hoc que-
ries based on these indices can be used. Queries are created as BSON objects
to make them more efficient and are similar to SQL queries. MongoDB sup-
ports MapReduce queries and atomic operations on individual fields within
the document. It allows realizing redundant and fault-tolerant systems that
can easily horizontally scaled, thanks to the sharing based on the document
keys and the support of asynchronous and master–slave replications. A rele-
vant advantage of MongoDB are the opportunities of creating data structures
to easily store polymorphic data, and the possibility of making elastic cloud
systems given its scale-out design, which increases ease of use and developer
flexibility. Moreover, server costs are significantly low because MongoDB
deployment can use commodity and inexpensive hardware, and their hori-
zontal scale-out architecture can also reduce storage costs.

Objectivity
[Objectivity Platform] is a distributed OODBMS (Object-Oriented Database
Management System) for applications that require complex data models.
It supports a large number of simultaneous queries and transactions and
provides high-performance access to large volumes of physically distrib-
uted data. Objectivity manages data in a transparent way and uses a dis-
tributed database architecture that allows good performance and scalability.
The main reasons for using a database of this type include the presence of
complex relationships that suggest tree structures or graphs, and the pres-
ence of complex data, that is, when there are components of variable length
and in particular multi-dimensional arrays. Other reasons are related to the
presence of a database that must be geographically distributed, and which
is accessed via a processor grid, or the use of more than one language or
platform, and the use of workplace objects. Objectivity has an architecture
consisting of a single distributed database, a choice that allows achieving
86 Big Data Computing

high performance in relation to the amount of data stored and the number of
users. This architecture distributes tasks for computation and data storage in
a transparent way through the different machines and it is also scalable and
has a great availability.

OpenQM
[OpenQM Database] is a DBMS that allows developing and run applications
that includes a wide range of tools and advanced features for complex appli-
cations. Its database model belongs to the family of Multivalue and therefore
has many aspects in common with databases Pick-descended and is trans-
actional. The development of applications Multivalue is often faster than
using other types of database and this therefore implies lower development
costs and easier maintenance. This instrument has a high degree of compat-
ibility with other types of systems with database Multivalue as UniVerse
[UniVerse], PI/open, D3, and others.
The RDF-HDT (Header-Dictionary-Triples) [RDF-HDT Library] is a new
representation format that modularizes data and uses structures of large
RDF graphs to get a big storage space and is based on three main compo-
nents: Header, Dictionary, and a set of triples. Header includes logical and
physical data that describes the RDF data set, and it is the entry point to the
data set. The Dictionary organizes all the identifiers in an RDF graph and
provides a catalog of the amount of information in RDF graph with a high
level of compression. The set of Triples, finally, includes the pure structure of
the underlying RDF graph and avoids the noise produced by long labels and
repetitions. This design gains in modularity and compactness, and addresses
other important characteristics: allows access addressed on-demand to the
RDF graph and is used to design specific compression techniques RDF (HDT-
compress) able to outperform universal compressors. RDF-HDT introduces
several advantages like compactness and compression of stored data, using
small amount of memory space, communication bandwidth, and time. RDF-
HDT uses a low storage space, thanks to the asymmetric structure of large
RDF graph and its representation format consists of two primary modules,
Dictionary and Triple. Dictionary contains mapping between elements and
unique IDs, without repetition, thanks to which achieves a high compres-
sion rate and speed in searches. Triple corresponds to the initial RDF graph
in a compacted form where elements are replaced with corresponding IDs.
Thanks to the two processes, HDT can be also generated from RDF (HDT
encoder) and can manage separate logins to run queries, to access full RDF
or to carry out management operations (HDT decoder)

RDF-3X
RDF-3X (Schramm, 2012) is an RDF store that implements SPARQL [SPARQL]
that achieves excellent performance making an RISC (Reduced Instruction
Tassonomy and Review of Big Data Solutions Navigation 87

Set Computer) architecture with efficient indexing and query processing.


The design of RDF-3X solution completely eliminates the process of tun-
ing indices thanks to an exhaustive index of all permutations of subject–­
predicate–object triples and their projections unary and binary, resulting in
highly compressed indices and in a query processor that can provide data
results with excellent performance. The query optimizer can choose the
optimal join orders also for complex queries and a cost model that includes
statistical synopses for entire join paths. RDF-3X is able to provide good sup-
port for efficient online updates, thanks to an architecture staging.

Comparison and Analysis of Architectural Features


A large number of products and solutions have been reviewed for analyz-
ing the most interested products for the readers and for the market, while
a selection of solutions and products has been proposed in this paper with
the aim of representing all of them. The analysis performed has been very
complex since a multidisciplinary team has been involved in assessing the
several aspects in multiple solutions including correlated issues in the case of
tools depending on other solutions (as reported in Table 2.1). This is due to the
fact that features of Big Data solutions are strongly intercorrelated and thus it
was impossible to identify orthogonal aspects to provide a simple and easy to
read taxonomical representation. Table 2.1 can be used to compare different
solutions in terms of: infrastructure and the architecture of the systems that
should cope with Big Data; data management aspects; and of data analytics
aspects, such as ingestion, log analysis, and everything else is pertinent to
post-production phase of data processing. Some of the information related
to specific features of products and tools have not been clearly identified. In
those cases, we preferred to report that the information was not available.

Application Domains Comparison


Nowadays, as reported in the introduction, Big Data solutions and technolo-
gies are currently used in many application domains with remarkable results
and  excellent future prospects to fully deal with the main challenges like data
modeling, organization, retrieval, and data analytics. A major investment in Big
Data solutions can lay the foundations for the next generations of advances in
medicine, science, research, education, and e-learning, business and financial,
healthcare, smart city, security, info mobility, social media, and networking.
In order to assess different fields and solutions, a number of factors
have been identified. In Table 2.2, the relevance of the main features of Big
88

Table 2.2
Relevance of the Main Features of Big Data Solutions with Respect to the Most Interesting Applicative Domains
Data Analysis Educational Social Network
Scientific and Internet
Research Cultural Energy/ Financial/ Smart Cities Social Media Service Web
(biomedical) Heritage Transportation Business Healthcare Security and mobility Marketing Data

Infrastructural and Architectural Aspects


Distributed H M H H H H M H H
management
High availability H M H H H H H M H
Internal H M H M H H H M M
parallelism
(related to
velocity)

Data Management Aspects


Data dimension H M M H M H+ H+ H+ H+
(data volume)
Data replication H L M H H H H H M
Data organization Chuncks, Blob Blob Cluster, Blob Cluster, Blob, Chuncks Blob Blob Chuncks (16 MB), Chunks, Bucket
Chuncks Blob
Data relationships H M L H H M H H H
SQL M M L H H L H H M
interoperability
Data variety H M M M M H H H H
Data variability H H M H H H H H H

Data Access Aspects


Data access H H L M H L H L H
performance
Big Data Computing
Data access type AQL (SQL Standard Remote user HDFS-UI web API, common Remote user Interfaces for API and Open ad-hoc
array version), interfaces, interface, get app, API, line interfaces, on-demand customized SQL access (es.
full SQL remote API, multiple common line interfaces, multiple access access, interfaces, HiveQL), Web
interfaces, user access query interfaces, concurrency from different ad-hoc SQL command Line, interfaces,
specific API interfaces AMS access query interfaces web interfaces, REST
application AMS interfaces
Visual rendering H M M H H L H M M
and visualization
Faceted Query L H L H H L L H H
results
Graph L M M H M L H H H
relationships
navigation

Data Analytics Aspects


Type of indexing Array Ontologies, Key index HDFS, index, Multi- Distributed Hash index, Distributed Inverted index,
multi- index distributed, RDBMS like, dimensional multi-level HDFS, index, RDBMS MinHash,
dimensional, RDBMS B-tree field B-tree index index, HDFS, tree indexing, RDF-graph like B+-tree, HDFS,
hash index like, indexes index, HDFS, index, RDF graph
RDBMS like RDBMS like
Indexing speed L L L H M H H M H
Semantic query H M M M H L L M H
Statistical analysis H H H H H L M H H
Tassonomy and Review of Big Data Solutions Navigation

tools in queries
CEP (active query) H L M H H M H L H
Log analysis L M L H H H H H H
Streaming M L M H M H H M H (network
processing monitoring)
89
90 Big Data Computing

Data solutions with respect to the most interesting applicative domains is


reported. When possible, each feature has been expressed for that applica-
tive domain in terms of relevance by expressing: high, medium, and low
­relevance/impact. For some features, the assessment of the grading was not
possible, and thus a comment including commonly adopted specific solu-
tions and/or technologies has been provided.
The main features are strongly similar to those adopted in Table 2.1 (the
features those not presented relevant differences in the different domains
have been removed to make the table more readable). Table 2.2 could be
considered a key lecture to compare Big Data solutions on the basis of the
relevant differences. The assessment has been performed according to the
state-of-the-art analysis of several Big Data applications in the domains and
corresponding solutions proposed. The application domains should be con-
sidered as macro-areas rather than specific scenarios. Despite the fact that
the state of the art of this matter is in continuous evolution, the authors
think that the work presented in this chapter can be used as the first step
to understand which are the key factors for the identification of the suitable
solutions to cope with a specific new domain. This means that the consid-
erations have to be taken as examples and generalization of the analyzed
cases.
Table 2.2 can be read line by line. For example, considering the infrastruc-
tural aspect of supporting distributed architectures “Distributed manage-
ment,” at the first glance, one could state that this feature is relevant for all
the applicative domains. On the other hand, a lower degree of relevance has
been notified for the educational domain. For example, the student profile
analysis for the purpose of the personalized courses are typically locally
based and are accumulating a lower number of information with respect to
global market analysis, security, social media, etc. The latter cases are typi-
cally deployed as multisite geographical distributed databases at the world-
wide level, while educational applications are usually confined at regional
and local levels. For instance, in high educational domains, a moderate num-
ber of information is available and their use is often confined at local level
for the students of the institute. This makes the advantage of geographically
distributed services less important and interesting. While social networking
applications typically needs highly distributed architectures since also their
users are geographically distributed. Similar considerations can be applied
for the demand of database consistency and may be about the high availabil-
ity that could be less relevant for educational and social media with respect
to the demands of safety critical situations of energy management and of
transportation. Moreover, the internal parallelism of the solution can be an
interesting feature that can be fully exploited only in specific cases depend-
ing on the data analytic algorithms adopted and when the problems can take
advantage from a parallel implementation of the algorithm. This feature is
strongly related to the reaction time, for example, in most of the social media
solution, the estimation of suggestions and thus of clustering user profiles
Tassonomy and Review of Big Data Solutions Navigation 91

and content is something that is performed offline updating values periodi-


cally but not in real time.
As regards the data management aspects, the amount of data involved
is considerably huge in almost all the application domains (in the order of
several petabytes and exabyte). On the other hand, this is not always true
for the size of individual files (with the exception of satellite images, medi-
cal images, or other multimedia files that can also be several gigabytes in
size). The two aspects (number of elements to be indexed and accessed, and
typical data size) are quite different and, in general, the former (number of
elements) creates major problems, for processing and accessing them. For
this reason, security, social media, and smart cities have been considered the
applicative domains with higher demand of Big Data solutions in terms of
volume. Moreover, in many cases, the main problem is not their size, rather
the management and the preservation of the relationships among the vari-
ous elements; they represent the effective semantic value of data set (the data
model of Table 2.1 may help in comparing the solutions). For example, for the
user profile (human relationships), traffic localization (service relationships,
time relationships), patients’ medical records (events and data relationships),
etc. Data relationships, often stored in dedicated structures, and making
specific queries and reasoning can be very important for some applications
such as social media and networking. Therefore, for this aspect, the most
challenging domains are again smart cities, social networks, and health care.
In this regard, in these last two application domains, the use of graph rela-
tionship navigation constitutes a particularly useful support to improve the
representation, research, and understanding of information and meanings
explicitly not evident in the data itself.
In almost all domains, the usage and the interoperability of both SQL and
NoSQL database are very relevant, and some differences can be detected in the
data organization. In particular, the interoperability with former SQL is a very
important feature in application contexts such as healthcare, social media mar-
keting, business, and smart cities, to the widespread use of traditional RDBMS,
rather than in application domains such as research scientific, social networks,
and web data, security, and energy, mainly characterized by unstructured or
semistructured data. Some of the applicative domains intrinsically present a
large variety and variability of data, while others present more standardized
and regular information. A few of these domains present both variety and
variability of data such as the scientific research, security, and social media
which may involve content-based analysis, video processing, etc.
Furthermore, the problems of data access is of particular importance
in terms of performances and for the provided features related to the
­rendering, representation, and/or navigation of produced results as visual
­rendering tools, presentation of results by using faceted facilities, etc. Most
of the domains are characterized by the needs of different custom interfaces
for the data rendering (built ad hoc on the main features) that provide safe
and consistent data access. For example, in health and scientific research, it is
92 Big Data Computing

important to take account of issues related to the concurrent access and thus
data consistency, while in social media and smart cities it is important to
provide on-demand and multidevice access to information, graphs, real-time
conditions, etc. A flexible visual rendering (distributions, pies, histograms,
trends, etc.) may be a strongly desirable features to be provided, for many
scientific and research applications, as well as for financial data and health
care (e.g., for reconstruction, trend analysis, etc.). Faceted query results can
be very interesting for navigating in mainly text-based Big Data as for edu-
cational and cultural heritage application domains. Graph navigation among
resulted relationships can be an avoidable solution to represent the resulted
data in smart cities and social media, and for presenting related implications
and facts in financial and business applications. Moreover, in certain specific
contexts, the data rendering has to be compliant with standards, for example,
in the health care.
In terms of data analytic aspects, several different features could be of
interest in the different domains. The most relevant feature in this area is
the type of indexing, which in turn characterizes the indexing performance.
The indexing performance are very relevant in the domains in which a huge
amount of small data have to be collected and need to be accessed and elabo-
rated in the short time, such as in finance, health care, security, and mobil-
ity. Otherwise, if the aim of the Big Data solution is mainly on the access
and data processing, then fast indexing can be less relevant. For example,
the use of HDFS may be suitable in contexts requiring complex and deep
data processing, such as the evaluation on the evolution of a particular dis-
ease in the medical field, or the definition of a specific business models.
This approach, in fact, runs the process function on a reduced data set, thus
achieving scalability and availability required for processing Big Data. In
education, instead, the usage of ontologies and thus of RDF databases and
graphs provides a rich semantic structure better than any other method of
knowledge representation, improving the precision of search and access for
educational contents, including the possibility of enforcing inference in the
semantic data structure.
The possibility of supporting statistical and logical analyses on data via
specific queries and reasoning can be very important for some applications
such as social media and networking. If this feature is structurally sup-
ported, it is possible to realize direct operations on the data, or define and
store specific queries to perform direct and fast statistical analysis: for exam-
ple, for estimating recommendations, firing conditions, etc.
In other contexts, however, it is very important to the continuous process-
ing of data streams, for example, to respond quickly to requests for informa-
tion and services by the citizens of a “smart city,” real-time monitoring of
the performance of financial stocks, or report to medical staff unexpected
changes in health status of patients under observation. As can be seen from
the table, in these contexts, a particularly significant feature is the use of
the approach CEP (complex event processing), based on active query, which
Tassonomy and Review of Big Data Solutions Navigation 93

Table 2.3
Relevance of the Main Features of Big Data Solutions with Respect to the Most
Interesting Applicative Domains

Google MapReduce

RdfHdt Library
ArrayDBMS

Objectivity
CouchBase

MongoDB
MonetDB

OpenQM
Hadoop

RDF 3X
HBase
Db4o

eXist

Hive
Data analysis scientific research X X X X X X X X X X
(biomedical)
Education and cultural heritage X X X X
Energy/transportation X X X X X
Financial/business X X X X X X X
Healthcare X X X X X X X X X
Security X X X Y
Smart mobility, smart cities X X X X X X
Social marketing X X X X X X X X X
Social media X X X X X X X X X X
Note: Y = frequently adopted.

improves the user experience with a considerable increase in the speed of


analysis; on the contrary, in the field of education, scientific research and
transport speed of analysis is not a feature of primary importance, since
in these contexts, the most important thing is the storage of data to keep
track of the results of experiments or phenomena or situations occurred in
a specific time interval, then the analysis is a passage that can be realized
at a later time.
Lastly, it is possible to observe Table 2.3 in which the considered applica-
tion domains are shown in relation to the products examined in the previous
session and to the reviewed scenarios. Among all the products reviewed,
MonetDB and MongoDB are among the most flexible and adaptable to dif-
ferent situations and contexts of applications. It is also interesting to note that
RDF-based solutions have been used mainly on social media and network-
ing applications.
Shown below are the most effective features of each product ana-
lyzed and the main application domains in which their use is commonly
suggested.
HBase over HDFS provides an elastic and fault-tolerant storage solution
and provides a strong consistency. Both HBase and HDFS are grounded on
the fundamental design principle of elasticity. Facebook messages (Aiyer
et al., 2012) exploit the potential of HBase to combine services such as mes-
sages, emails, and chat in a real-time conversation, which leads to manage
94 Big Data Computing

approximately 14 TB of messages and 11 TB of chats, each month. For these


reasons, it is successfully used in the field of social media Internet services
as well as on social media marketing.
RDF-3X is considered as one of the fastest RDF representations and it pro-
vides an advantage with handling of the small data. The physical design of
RDF-3X completely eliminates the need for index tuning, thanks to highly
compressed indices for all permutations of triples and their binary and
unary projections. Moreover, RDF-3X is optimized for queries and pro-
vides a suitable support for efficient online updates by means of a staging
architecture.
MonetDB achieves a significant speed improvement for both relational/
SQL and XML/XQuery databases over other open-source systems; it intro-
duces innovations at all layers of a DBMS: a storage model based on vertical
fragmentation (column store), a modern CPU-tuned query execution archi-
tecture, automatic, and self-tuning indexes, a run-time query optimization,
and a modular software architecture. MonetDB is primarily used for the
management of the large amount of images, for example, in astronomy, seis-
mology, and earth observations. These relevant features collocate MonetDB
as on the best tools in the field of scientific research and scientific data analy-
sis, thus defining an interesting technology on which to develop scientific
applications and create interdisciplinary platform for the exchange of data
in the world community of researchers.
HDT has been proved by experiments to be a good tool for compacting
data set. It allows to be compacted more than 15 times with respect to stan-
dard RDF representations; thus, improving parsing and processing, while
maintaining a consistent publication scheme. Thus, RDF-HDT allows to
improve compactness and compression, using much less space, thus saving
storing space and communication bandwidth. For these reasons, this solu-
tion is especially suited to share data on the web, but also in those contexts
that require operations such as data analysis and visualization of results,
thanks to the support of 3D visualization of the RDF Adjacency matrix of the
RDF Graph.
eXist’s query engine implements efficient index structure to collect data for
scientific and academic research, educational assessments, and for consump-
tion in the energy sector, and its index-based query processing is needed
to efficiently perform queries on large document collections. Experiments
have, moreover, demonstrated the linear scalability of eXist’s indexing, stor-
age, and querying architecture. In general, the search expressions using full
text index perform better with eXist, than that with the corresponding que-
ries based on XPath.
In scientific/technical applications, ArrayDBMS is often used in combi-
nation with complex queries, and therefore, the optimization results are
fundamental. ArrayDBMS may be used with both hardware and software
parallelisms, which make possible the realization of efficient systems in many
fields, such as geology, oceanography, atmospheric sciences, gene data, etc.
Tassonomy and Review of Big Data Solutions Navigation 95

Objectivity/DB guarantees a complete support for ACID and can be repli-


cated to multiple locations. Objectivity/DB is highly reliable and thanks to the
possibility of schema evolution, it provides advantages over other technolo-
gies that had a difficult time with change/update a field. Thus, it has been
typically used for making data-insensitive systems or real-time applications,
which manipulate the large volumes of complex data. Precisely because of
these features, the main application fields are the healthcare and financial ser-
vices, respectively, for the real-time management of electronic health records
and for the analysis of products with higher consumption, with also the mon-
itoring of sensitive information to support intelligence services.
OpenQM enables the system development with reliability and also pro-
vides efficiency and stability. The choice of using OpenQM is usually related
to the need for speed, security, and reliability and also related to the ability
of easily built excellent GUI interfaces into the database.
Couchbase is a high-performance and scalable data solution supporting
high availability, fault tolerance, and data security. Couchbase may pro-
vide extremely fast response time. It is particularly suggested for applica-
tions developed to support the citizens in the new model of smart urban
cities (smart mobility, energy consumption, etc.). Thanks to its low latency,
Couchbase is mainly used in the development of gaming online, and, in
applications where obtaining a significant performance improvements is
very important, or where the extraction of meaningful information from the
large amount of data constantly exchanged is mandatory. For example, in
social networks such as Twitter, Facebook, Flickr, etc.
The main advantage of Hadoop is its ability to analyze huge data sets
to quickly spot trends. In fact, most customers use Hadoop together with
other types of software such as HDFS. The adoption of Google MapReduce
provides several benefits: the indexing code is simpler, smaller, and easier
to understand, and it guarantees fault tolerance and parallelization. Both
Hadoop and Google MapReduce are preferably used in applications requir-
ing large distributed computation. The New York Times, for example, uses
Hadoop to process row images and turn them into a pdf format in an accept-
able time (about 24 h each 4 TB of images). Other big companies exploit the
potential of these products: Ebay, Amazon, Twitter, and Google itself that
uses MapReduce to regenerate the Google’s Index, to update indices and to
run various types of analyses. Furthermore, this technology can be used in
medical fields to perform large-scale data analysis with the aim of improv-
ing treatments and prevention of disease.
Hive significantly reduces the effort required for a migration to Hadoop,
which makes it perfect for data warehousing and also it has the ability to cre-
ate ad-hoc queries, using a jargon similar to SQL. These features make Hive
excellent for the analysis of large data sets especially in social media market-
ing and web application business.
MongoDB provides relevant flexibility and simplicity, which may reduce
development and data modeling time. It is typically used in applications
96 Big Data Computing

requiring insertion and updating in real time, in addition to real-time query


processing. It allows one to define the consistency level that is directly
related to the achievable performance. If high performance is not a necessity,
it is possible to obtain maximum consistency, waiting until the new element
has been replicated to all nodes. MongoDB uses internal memory to store the
working set, thus allowing faster access of data. Thanks to its characteristics,
MongoDB is easily usable in business and in social marketing fields and, it
is actually successfully used in Gaming environment, thanks to its high per-
formance for small operations of read/write. As many other Big Data solu-
tions, it is well suited for applications that handled high volumes of data
where traditional DBMS might be too expensive.
Db4o does not need a mapping function between the representation
in memory and what actually is stored on disk, because the application
schema corresponds with the data schema. This advantage allows one to
obtain better performance and good user experience. Db4o also permits one
to database access by using simple programing language (Java, .NET, etc.),
and thanks to its type safety, it does not need to hold in check query against
code injection. Db4o supports the paradigm CEP (see Table 2.2) and is there-
fore very suitable for medical applications, scientific research, analysis of
financial and real-time data streams, in which the demand for this feature
is very high.

Conclusions
We have entered an era of Big Data. There is the potential for making faster
advances in many scientific disciplines through better analysis of these large
volumes of data and also for improving the profitability of many enterprises.
The need for these new-generation data management tools is being driven
by the explosion of Big Data and by the rapidly growing volumes and vari-
ety of data that are collecting today from alternative sources such as social
networks like Twitter and Facebook.
NoSQL Database Management Systems represents a possible solution to
these problems; unfortunately they are not a definitive solutions: these tools
have a wide range of features that can be further developed to create new
products more adaptable to this huge stream of data constantly growing and
to its open challenge such as error handling, privacy, unexpected correlation
detection, trend analysis and prediction, timeliness analysis, and visualiza-
tion. Considering this latter challenge, it is clear that, in a fast-growing mar-
ket for maps, charts, and other ways to visually sort using data, these larger
volumes of data and analytical capabilities become the new coveted features;
today, in fact in the “Big Data world,” static bar charts and pie charts just
do not make more sense, and more and more companies are demanding
Tassonomy and Review of Big Data Solutions Navigation 97

more dynamic, interactive tools, and methods for line-of-business managers


and information workers for viewing, understanding, and operating on the
analysis of big data.
Each product compared in this review presents different features that may
be needed in different situations with which we are dealing. In fact, there is
still no definitive ultimate solution for the management of Big Data. The best
way to determine on which product to base the development of your sys-
tem may consist in analyzing the available data sets carefully and determine
what are the requirements to which you cannot give up. Then, an analysis
of the existing products is needed to determine the pros and cons, also con-
sidering other nonfunctional features such as the programing language, the
integration aspects, the legacy constraints, etc.

References
Abouzeid A., Bajda-Pawlikowski C., Abadi D., Silberschatz A., Rasin A., HadoopDB:
An architectural hybrid of MapReduce and DBMS technologies for analytical
workloads. Proceedings of the VLDB Endowment, 2(1), 922–933, 2009.
Aiyer A., Bautin M., Jerry Chen G., Damania P., Khemani P., Muthukkaruppan K.,
Ranganathan K., Spiegelberg N., Tang L., Vaidya M., Storage infrastructure
behind Facebook messages using HBase at scale. Bulletin of the IEEE Computer
Society Technical Committee on Data Engineering, 35(2), 4–13, 2012.
AllegroGraph, http://www.franz.com/agraph/allegrograph/
Amazon AWS, http://aws.amazon.com/
Amazon Dynamo, http://aws.amazon.com/dynamodb/
Antoniu G., Bougè L., Thirion B., Poline J.B., AzureBrain: Large-scale Joint Genetic
and Neuroimaging Data Analysis on Azure Clouds, Microsoft Research Inria Joint
Centre, Palaiseau, France, September 2010. http://www.irisa.fr/kerdata/lib/
exe/fetch.php?media=pdf:inria-microsoft.pdf
Apache Cassandra, http://cassandra.apache.org/
Apache HBase, http://hbase.apache.org/
Apache Hive, http://hive.apache.org/
Apache Solr, http://lucene.apache.org/solr/
Baumann P., Dehmel A., Furtado P., Ritsch R., The multidimensional database sys-
tem RasDaMan. SIGMOD’98 Proceedings of the 1998 ACM SIGMOD International
Conference on Management of Data, Seattle, Washington, pp. 575–577, 1998,
ISBN: 0-89791-995-5.
Bellini P., Cenni D., Nesi P., On the effectiveness and optimization of information
retrieval for cross media content, Proceedings of the KDIR 2012 is Part of IC3K 2012,
International Joint Conference on Knowledge Discovery, Knowledge Engineering
and Knowledge Management, Barcelona, Spain, 2012a.
Bellini P., Bruno, I., Cenni, D., Fuzier, A., Nesi, P., Paolucci, M., Mobile medicine:
Semantic computing management for health care applications on desktop and
mobile devices. Multimedia Tools and Applications, Springer, 58(1), 41–79, 2012b.
98 Big Data Computing

DOI 10.1007/s11042-010-0684-y. http://link.springer.com/article/10.1007/


s11042-010-0684-y
Bellini P., Bruno I., Cenni D., Nesi P., Micro grids for scalable media computing and
intelligence in distributed scenarios. IEEE MultiMedia, 19(2), 69–79, 2012c.
Bellini P., Bruno I., Nesi P., Rogai D., Architectural solution for interoperable content
and DRM on multichannel distribution, Proc. of the International Conference on
Distributed Multimedia Systems, DMS 2007, Organised by Knowledge Systems
Institute, San Francisco Bay, USA, 2007.
Bellini P., Nesi P., Pazzaglia F., Exploiting P2P scalability for grant authorization digi-
tal rights management solutions, Multimedia Tools and Applications, Multimedia
Tools and Applications Journal, Springer, April 2013.
Ben-Yitzhak O., Golbandi N., Har’El N., Lempel R., Neumann A., Ofek-Koifman S.,
Sheinwald D., Shekita E., Sznajder B., Yogev S., Beyond basic faceted search,
Proc. of the 2008 International Conference on Web Search and Data Mining, pp.
33–44, 2008.
BlueGene IBM project, http://www.research.ibm.com/bluegene/index.html
Borthakur D., Muthukkaruppan K., Ranganathan K., Rash S., SenSarma J.,
Spielberg N., Molkov D. et  al., Apache Hadoop goes realtime at Facebook,
Proceedings of the 2011 International Conference on Management of Data, Athens,
Greece, 2011.
Bose I., Mahapatra R.K., Business data mining—a machine learning perspective.
Information & Management, 39(3), 211–225, 2001.
Brewer E., CAP twelve years later: How the rules have changed. IEEE Computer, 45(2),
23–29, 2012.
Brewer E., Lesson from giant-scale services. IEEE Internet Computing, 5(4), 46–55,
2001.
Bryant R., Katz R.H., Lazowska E.D., Big-data computing: Creating revolution-
ary breakthroughs in commerce, science and society, In Computing Research
Initiatives for the 21st Century, Computing Research Association, Ver.8, 2008. http://
www.cra.org/ccc/files/docs/init/Big_Data.pdf
Bryant R.E., Carbonell J.G., Mitchell T., From data to knowledge to action: Enabling
advanced intelligence and decision-making for America’s security, Computing
Community Consortium, Version 6, July 28, 2010.
Cao L., Wang Y., Xiong J., Building highly available cluster file system based on
replication, International Conference on Parallel and Distributed Computing,
Applications and Technologies, Higashi Hiroshima, Japan, pp. 94–101, December
2009.
Cattel R., Scalable SQL and NoSQL data stores. ACM SIGMOND Record, 39(4), 12–27,
2010.
Couchbase, http://www.couchbase.com/
Cunningham H., Maynard D., Bontcheva K., Tablan V., GATE: A framework and
graphical development environment for robust NLP tools and applications,
Proceedings of the 40th Anniversary Meeting of the Association for Computational
Linguistics, Philadelphia, July 2002.
De Witt D.J., Paulson E., Robinson E., Naugton J., Royalty J., Shankar S., Krioukov A.,
Clustera: an integrated computation and data management system. Proceedings
of the VLDB Endowment, 1(1), 28–41, 2008.
De Witt S., Sinclair R., Sansum A., Wilson M., Managing large data volumes from
scientific facilities. ERCIM News 89, 15, 2012.
Tassonomy and Review of Big Data Solutions Navigation 99

Domingos P., Mining social networks for viral marketing. IEEE Intelligent Systems,
20(1), 80–82, 2005.
Dykstra D., Comparison of the frontier distributed database caching system to NoSQL
databases, Computing in High Energy and Nuclear Physics (CHEP) Conference,
New York, May 2012.
Eaton C., Deroos D., Deutsch T., Lapis G., Understanding Big Data: Analytics for
Enterprise Class Hadoop and Streaming Data, McGraw Hill Professional, McGraw
Hill, New York, 2012, ISBN: 978-0071790536.
ECLAP, http://www.eclap.eu
Europeana Portal, http://www.europeana.eu/portal/
Everitt B., Landau S., Leese M., Cluster Analysis, 4th edition, Arnold, London, 2001.
Figueireido V., Rodrigues F., Vale Z., An electric energy consumer characteriza-
tion framework based on data mining techniques. IEEE Transactions on Power
Systems, 20(2), 596–602, 2005.
Foster I., Jeffrey M., and Tuecke S. Grid services for distributed system integration,
IEEE Computer, 5(6), 37–46, 2002.
Fox A., Brewer E.A., Harvest, yield, and scalable tolerant systems, Proceedings of the
Seventh Workshop on Hot Topics in Operating Systems, Rio Rico, Arizona, pp. 174–
178, 1999.
Gallego M.A., Fernandez J.D., Martinez-Prieto M.A., De La Fuente P., RDF visual-
ization using a three-dimensional adjacency matrix, 4th International Semantic
Search Workshop (SemSearch), Hyderabad, India, 2011.
Ghosh D., Sharman R., Rao H.R., Upadhyaya S., Self-healing systems—Survey and
synthesis, Decision Support Systems, 42(4), 2164–2185, 2007.
Google Drive, http://drive.google.com
GraphBase, http://graphbase.net/
Gulisano V., Jimenez-Peris R., Patino-Martinez M., Soriente C., Valduriez P., A big
data platform for large scale event processing, ERCIM News, 89, 32–33, 2012.
Hadoop Apache Project, http://hadoop.apache.org/
Hanna M., Data mining in the e-learning domain. Campus-Wide Information Systems,
21(1), 29–34, 2004.
Iaconesi S., Persico O., The co-creation of the city, re-programming cities using
real-time user generated content, 1st Conference on Information Technologies for
Performing Arts, Media Access and Entertainment, Florence, Italy, 2012.
Iannella R., Open digital rights language (ODRL), Version 1.1 W3C Note, 2002,
http://www.w3.org/TR/odrl
Jacobs A., The pathologies of big data. Communications of the ACM—A Blind Person’s
Interaction with Technology, 52(8), 36–44, 2009.
Jagadish H.V., Narayan P.P.S., Seshadri S., Kanneganti R., Sudarshan S., Incremental
organization for data recording and warehousing, Proceedings of the 23rd
International Conference on Very Large Data Bases, Athens, Greece, pp. 16–25,
1997.
Karloff H., Suri S., Vassilvitskii S., A model of computation for MapReduce. Proceedings
of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pp.
938–948, 2010.
LHC, http://public.web.cern.ch/public/en/LHC/LHC-en.html
Liu L., Biderman A., Ratti C., Urban mobility landscape: Real time monitoring of urban
mobility patterns, Proceedings of the 11th International Conference on Computers in
Urban Planning and Urban Management, Hong Kong, June 2009.
100 Big Data Computing

Mans R.S., Schonenberg M.H., Song M., Van der Aalst W.M.P., Bakker P.J.M.,
Application of process mining in healthcare—A case study in a Dutch hospital.
Biomedical Engineering Systems and Technologies, Communications in Computer and
Information Science, 25(4), 425–438, 2009.
McHugh J., Widom J., Abiteboul S., Luo Q., Rajaraman A., Indexing semistructured
data, Technical report, Stanford University, California, 1998.
Meier W., eXist: An open source native XML database. Web, Web-Services, and Database
Systems—Lecture Notes in Computer Science, 2593, 169–183, 2003.
Memcached, http://memcached.org/
Microsoft Azure, http://www.windowsazure.com/it-it/
Microsoft, Microsoft private cloud. Tech. rep., 2012.
Mislove A., Gummandi K.P., Druschel P., Exploiting social networks for Internet
search, Record of the Fifth Workshop on Hot Topics in Networks: HotNets V, Irvine,
CA, pp. 79–84, November 2006.
MongoDB, http://www.mongodb.org/
MPEG-21, http://mpeg.chiariglione.org/standards/mpeg-21/mpeg-21.htm
Neo4J, http://neo4j.org/
Norrie M.C., Grossniklaus M., Decurins C., Semantic data management for db4o,
Proceedings of 1st International Conference on Object Databases (ICOODB 2008),
Frankfurt/Main, Germany, pp. 21–38, 2008.
NoSQL DB, http://nosql-database.org/
Obenshain M.K., Application of data mining techniques to healthcare data, Infection
Control and Hospital Epidemiology, 25(8), 690–695, 2004.
Objectivity Platform, http://www.objectivity.com
Olston C., Jiang J., Widom J., Adaptive filters for continuous queries over distributed
data streams, Proceedings of the 2003 ACM SIGMOD International Conference on
Management of Data, pp. 563–574, 2003.
OpenCirrus Project, https://opencirrus.org/
OpenQM Database, http://www.openqm.org/docs/
OpenStack Project, http://www.openstack.org
Oracle Berkeley, http://www.oracle.com/technetwork/products/berkeleydb/
Pierre G., El Helw I., Stratan C., Oprescu A., Kielmann T., Schuett T., Stender J.,
Artac M., Cernivec A., ConPaaS: An integrated runtime environment for elastic
cloud applications, ACM/IFIP/USENIX 12th International Middleware Conference,
Lisboa, Portugal, December 2011.
RDF-HDT Library, http://www.rdfhdt.org
Rivals E., Philippe N., Salson M., Léonard M., Commes T., Lecroq T., A scalable
indexing solution to mine huge genomic sequence collections. ERCIM News,
89, 20–21, 2012.
Rusitschka S., Eger K., Gerdes C., Smart grid data cloud: A model for utilizing cloud
computing in the smart grid domain, 1st IEEE International Conference of Smart
Grid Communications, Gaithersburg, MD, 2010.
Setnes M., Kaymak U., Fuzzy modeling of client preference from large data sets: An
application to target selection in direct marketing. IEEE Transactions on Fuzzy
Systems, 9(1), February 2001.
SCAPE Project, http://scape-project.eu/
Schramm M., Performance of RDF representations, 16th TSConIT, 2012.
Silvestri Ludovico (LENS), Alessandro Bria (UCBM), Leonardo Sacconi (LENS), Anna
Letizia Allegra Mascaro (LENS), Maria Chiara Pettenati (ICON), SanzioBassini
Tassonomy and Review of Big Data Solutions Navigation 101

(CINECA), Carlo Cavazzoni (CINECA), Giovanni Erbacci (CINECA), Roberta


Turra (CINECA), Giuseppe Fiameni (CINECA), Valeria Ruggiero (UNIFE),
Paolo Frasconi (DSI-UNIFI), Simone Marinai (DSI-UNIFI), Marco Gori (DiSI-
UNISI), Paolo Nesi (DSI-UNIFI), Renato Corradetti (Neuroscience-UNIFI),
GiulioIannello (UCBM), Francesco SaverioPavone (ICON, LENS), Projectome:
Set up and testing of a high performance computational infrastructure for pro-
cessing and visualizing neuro-anatomical information obtained using confocal
ultra-microscopy techniques, Neuroinformatics 2012 5th INCF Congress, Munich,
Germany, September 2012.
Snell A., Solving big data problems with private cloud storage, White paper, October
2011.
SPARQL at W3C, http://www.w3.org/TR/rdf-sparql-query/
Thelwall M., A web crawler design for data mining. Journal of Information Science,
27(1), 319–325, 2001.
Thusoo A., Sarma J.S., Jain N., Zheng Shao, Chakka P., Ning Zhang, Antony S., Hao
Liu, Murthy R., Hive—A petabyte scale data warehouse using Hadoop, IEEE
26th International Conference on Data Engineering (ICDE), pp. 996–1005, Long
Beach, CA, March 2010.
UniVerse, http://u2.rocketsoftware.com/products/u2-universe
Valduriez P., Pacitti E., Data management in large-scale P2P systems. High
Performance Computing for Computational Science Vecpar 2004—Lecture Notes
in Computer Science, 3402, 104–118, 2005.
Woolf B.P., Baker R., Gianchandani E.P., From Data to Knowledge to Action: Enabling
Personalized Education. Computing Community Consortium, Version 9,
Computing Research Association, Washington DC, 2 September 2010. http://
www.cra.org/ccc/files/docs/init/Enabling_Personalized_Education.pdf
Xui R., Wunsch D.C. II, Clustering, John Wiley and Sons, USA, 2009.
Yang H., Dasdan A., Hsiao R.L., Parker D.S., Map-reduce-merge: simplified relational
data processing on large clusters, SIGMOD’07 Proceedings of the 2007 ACM
SIGMOD International Conference on Management of Data, Beijing, China, pp.
1029–1040, June 2007, ISBN: 978-1-59593-686-8.
Yang J., Tang D., Zhou Y., A distributed storage model for EHR based on HBase,
International Conference on Information Management, Innovation Management and
Industrial Engineering, Shenzhen, China, 2011.
Zenith, http://www-sop.inria.fr/teams/zenith/
Zhang Y., Kersten M., Ivanova M., Pirk H., Manegold S., An implementation of ad-
hoc array queries on top of MonetDB, TELEIOS FP7-257662 Deliverable D5.1,
February 2012.
Zinterhof P., Computer-aided diagnostics. ERCIM News, 89, 46, 2012.
This page intentionally left blank
3
Big Data: Challenges and Opportunities

Roberto V. Zicari

Contents
Introduction.......................................................................................................... 104
The Story as it is Told from the Business Perspective..................................... 104
The Story as it is Told from the Technology Perspective................................ 107
Data Challenges............................................................................................... 107
Volume......................................................................................................... 107
Variety, Combining Multiple Data Sets................................................... 108
Velocity......................................................................................................... 108
Veracity, Data Quality, Data Availability................................................. 109
Data Discovery............................................................................................ 109
Quality and Relevance............................................................................... 109
Data Comprehensiveness.......................................................................... 109
Personally Identifiable Information......................................................... 109
Data Dogmatism......................................................................................... 110
Scalability..................................................................................................... 110
Process Challenges.......................................................................................... 110
Management Challenges................................................................................ 110
Big Data Platforms Technology: Current State of the Art......................... 111
Take the Analysis to the Data!.................................................................. 111
What Is Apache Hadoop?.......................................................................... 111
Who Are the Hadoop Users?.................................................................... 112
An Example of an Advanced User: Amazon.......................................... 113
Big Data in Data Warehouse or in Hadoop?........................................... 113
Big Data in the Database World (Early 1980s Till Now)....................... 113
Big Data in the Systems World (Late 1990s Till Now)........................... 113
Enterprise Search........................................................................................ 115
Big Data “Dichotomy”............................................................................... 115
Hadoop and the Cloud.............................................................................. 116
Hadoop Pros................................................................................................ 116
Hadoop Cons.............................................................................................. 116
Technological Solutions for Big Data Analytics.......................................... 118
Scalability and Performance at eBay....................................................... 122
Unstructured Data...................................................................................... 123
Cloud Computing and Open Source....................................................... 123

103
104 Big Data Computing

Big Data Myth.................................................................................................. 123


Main Research Challenges and Business Challenges................................ 123
Big Data for the Common Good........................................................................ 124
World Economic Forum, the United Nations Global Pulse Initiative..... 124
What Are the Main Difficulties, Barriers Hindering Our
Community to Work on Social Capital Projects?........................................ 125
What Could We Do to Help Supporting Initiatives for Big Data
for Good?.......................................................................................................... 126
Conclusions: The Search for Meaning Behind Our Activities....................... 127
Acknowledgments............................................................................................... 128
References.............................................................................................................. 128

Introduction
“Big Data is the new gold” (Open Data Initiative)

Every day, 2.5 quintillion bytes of data are created. These data come from
digital pictures, videos, posts to social media sites, intelligent sensors, pur-
chase transaction records, cell phone GPS signals, to name a few. This is
known as Big Data.
There is no doubt that Big Data and especially what we do with it has the
potential to become a driving force for innovation and value creation. In
this chapter, we will look at Big Data from three different perspectives: the
business perspective, the technological perspective, and the social good
perspective.

The Story as it is Told from the Business Perspective


Now let us define the term Big Data. I have selected a definition, given by
McKinsey Global Institute (MGI) [1]:

“Big Data” refers to datasets whose size is beyond the ability of typical
database software tools to capture, store, manage and analyze.

This definition is quite general and open ended, and well captures the rapid
growth of available data, and also shows the need of technology to “catch up”
with it. This definition is not defined in terms of data size; in fact, data sets
will increase in the future! It also obviously varies by sectors, ranging from a
few dozen terabytes to multiple petabytes (1 petabyte is 1000 terabytes).
Big Data 105

(Big) Data is in every industry and business function and is an important


factor for production. MGI estimated that 7 exabytes of new data enterprises
globally were stored in 2010. Interestingly, more than 50% of IP traffic is non-
human, and M2M will become increasingly important. So what is Big Data
supposed to create? Value. But what “value” exactly? Big Data per se does not
produce any value.
David Gorbet of MarkLogic explains [2]: “the increase in data complexity
is the biggest challenge that every IT department and CIO must address.
Businesses across industries have to not only store the data but also be able
to leverage it quickly and effectively to derive business value.”
Value comes only from what we infer from it. That is why we need Big Data
Analytics.
Werner Vogels, CTO of Amazon.com, describes Big Data Analytics as fol-
lows [3]: “in the old world of data analysis you knew exactly which questions
you wanted to asked, which drove a very predictable collection and storage
model. In the new world of data analysis your questions are going to evolve
and changeover time and as such you need to be able to collect, store and
analyze data without being constrained by resources.”
According to MGI, the “value” that can be derived by analyzing Big Data
can be spelled out as follows:

• Creating transparencies;
• Discovering needs, exposing variability, and improving performance;
• Segmenting customers; and
• Replacing/supporting human decision-making with automated algo-
rithms—Innovating new business models, products, and services.

“The most impactful Big Data Applications will be industry- or even


­ rganization-specific, leveraging the data that the organization consumes
o
and generates in the course of doing business. There is no single set for-
mula for extracting value from this data; it will depend on the application”
explains David Gorbet.
“There are many applications where simply being able to comb through
large volumes of complex data from multiple sources via interactive que-
ries can give organizations new insights about their products, customers,
services, etc. Being able to combine these interactive data explorations with
some analytics and visualization can produce new insights that would oth-
erwise be hidden. We call this Big Data Search” says David Gorbet.
Gorbet’s concept of “Big Data Search” implies the following:

• There is no single set formula for extracting value from Big Data; it
will depend on the application.
• There are many applications where simply being able to comb
through large volumes of complex data from multiple sources via
106 Big Data Computing

interactive queries can give organizations new insights about their


products, customers, services, etc.
• Being able to combine these interactive data explorations with some
analytics and visualization can produce new insights that would
otherwise be hidden.

Gorbet gives an example of the result of such Big Data Search: “it was anal-
ysis of social media that revealed that Gatorade is closely associated with flu
and fever, and our ability to drill seamlessly from high-level aggregate data
into the actual source social media posts shows that many people actually
take Gatorade to treat flu symptoms. Geographic visualization shows that
this phenomenon may be regional. Our ability to sift through all this data
in real time, using fresh data gathered from multiple sources, both internal
and external to the organization helps our customers identify new actionable
insights.”
Where Big Data will be used? According to MGI, Big Data can generate finan-
cial value across sectors. They identified the following key sectors:

• Health care (this is a very sensitive area, since patient records and, in
general, information related to health are very critical)
• Public sector administration (e.g., in Europe, the Open Data
Initiative—a European Commission initiative which aims at open-
ing up Public Sector Information)
• Global personal location data (this is very relevant given the rise of
mobile devices)
• Retail (this is the most obvious, since the existence of large Web
retail shops such as eBay and Amazon)
• Manufacturing

I would add to the list two additional areas

• Social personal/professional data (e.g., Facebook, Twitter, and the


like)

What are examples of Big Data Use Cases? The following is a sample list:

• Log analytics
• Fraud detection
• Social media and sentiment analysis
• Risk modeling and management
• Energy sector
Big Data 107

Currently, the key limitations in exploiting Big Data, according to MGI, are

• Shortage of talent necessary for organizations to take advantage of


Big Data
• Shortage of knowledge in statistics, machine learning, and data
mining

Both limitations reflect the fact that the current underlying technology is
quite difficult to use and understand. As every new technology, Big Data
Analytics technology will take time before it will reach a level of maturity
and easiness to use for the enterprises at large. All the above-mentioned
examples of values generated by analyzing Big Data, however, do not take
into account the possibility that such derived “values” are negative.
In fact, the analysis of Big Data if improperly used poses also issues, specifi-
cally in the following areas:

• Access to data
• Data policies
• Industry structure
• Technology and techniques

This is outside the scope of this chapter, but it is for sure one of the most
important nontechnical challenges that Big Data poses.

The Story as it is Told from the Technology Perspective


The above are the business “promises” about Big Data. But what is the reality
today? Big data problems have several characteristics that make them techni-
cally challenging.
We can group the challenges when dealing with Big Data in three dimen-
sions: data, process, and management. Let us look at each of them in some detail:

Data Challenges
Volume
The volume of data, especially machine-generated data, is exploding,
how fast that data is growing every year, with new sources of data
that are emerging. For example, in the year 2000, 800,000 petabytes
(PB) of data were stored in the world, and it is expected to reach 35
zettabytes (ZB) by 2020 (according to IBM).
108 Big Data Computing

Social media plays a key role: Twitter generates 7+ terabytes (TB) of


data every day. Facebook, 10 TB. Mobile devices play a key role as
well, as there were estimated 6 billion mobile phones in 2011.

The challenge is how to deal with the size of Big Data.

Variety, Combining Multiple Data Sets


More than 80% of today’s information is unstructured and it is typically too
big to manage effectively. What does it mean?
David Gorbet explains [2]:

It used to be the case that all the data an organization needed to run
its operations effectively was structured data that was generated
within the organization. Things like customer transaction data,
ERP data, etc. Today, companies are looking to leverage a lot more
data from a wider variety of sources both inside and outside the
organization. Things like documents, contracts, machine data, sen-
sor data, social media, health records, emails, etc. The list is endless
really.
A lot of this data is unstructured, or has a complex structure that’s
hard to represent in rows and columns. And organizations want to
be able to combine all this data and analyze it together in new ways.
For example, we have more than one customer in different industries
whose applications combine geospatial vessel location data with
weather and news data to make real-time mission-critical decisions.
Data come from sensors, smart devices, and social collaboration tech-
nologies. Data are not only structured, but raw, semistructured,
unstructured data from web pages, web log files (click stream data),
search indexes, e-mails, documents, sensor data, etc.
Semistructured Web data such as A/B testing, sessionization, bot
detection, and pathing analysis all require powerful analytics on
many petabytes of semistructured Web data.

The challenge is how to handle multiplicity of types, sources, and formats.

Velocity

Shilpa Lawande of Vertica defines this challenge nicely [4]: “as busi-
nesses get more value out of analytics, it creates a success problem—
they want the data available faster, or in other words, want real-time
analytics.
And they want more people to have access to it, or in other words, high
user volumes.”
Big Data 109

One of the key challenges is how to react to the flood of information in the
time required by the application.

Veracity, Data Quality, Data Availability


Who told you that the data you analyzed is good or complete? Paul Miller
[5] mentions that “a good process will, typically, make bad decisions if based
upon bad data. E.g. what are the implications in, for example, a Tsunami that
affects several Pacific Rim countries? If data is of high quality in one country,
and poorer in another, does the Aid response skew ‘unfairly’ toward the
well-surveyed country or toward the educated guesses being made for the
poorly surveyed one?”
There are several challenges:

How can we cope with uncertainty, imprecision, missing values, mis-


statements or untruths?
How good is the data? How broad is the coverage?
How fine is the sampling resolution? How timely are the readings?
How well understood are the sampling biases?
Is there data available, at all?

Data Discovery
This is a huge challenge: how to find high-quality data from the vast collec-
tions of data that are out there on the Web.

Quality and Relevance


The challenge is determining the quality of data sets and relevance to par-
ticular issues (i.e., the data set making some underlying assumption that ren-
ders it biased or not informative for a particular question).

Data Comprehensiveness
Are there areas without coverage? What are the implications?

Personally Identifiable Information


Much of this information is about people. Partly, this calls for effective indus-
trial practices. “Partly, it calls for effective oversight by Government. Partly—
perhaps mostly—it requires a realistic reconsideration of what privacy really
means”. (Paul Miller [5])
Can we extract enough information to help people without extracting so
much as to compromise their privacy?
110 Big Data Computing

Data Dogmatism
Analysis of Big Data can offer quite remarkable insights, but we must be
wary of becoming too beholden to the numbers. Domain experts—and com-
mon sense—must continue to play a role.
For example, “It would be worrying if the healthcare sector only responded
to flu outbreaks when Google Flu Trends told them to.” (Paul Miller [5])

Scalability
Shilpa Lawande explains [4]: “techniques like social graph analysis, for
instance leveraging the influencers in a social network to create better
user experience are hard problems to solve at scale. All of these problems
combined create a perfect storm of challenges and opportunities to create
faster, cheaper and better solutions for Big Data analytics than traditional
approaches can solve.”

Process Challenges
“It can take significant exploration to find the right model for analysis, and
the ability to iterate very quickly and ‘fail fast’ through many (possible throw
away) models—at scale—is critical.” (Shilpa Lawande)
According to Laura Haas (IBM Research), process challenges with deriv-
ing insights include [5]:

• Capturing data
• Aligning data from different sources (e.g., resolving when two
objects are the same)
• Transforming the data into a form suitable for analysis
• Modeling it, whether mathematically, or through some form of
simulation
• Understanding the output, visualizing and sharing the results,
think for a second how to display complex analytics on a iPhone or
a mobile device

Management Challenges
“Many data warehouses contain sensitive data such as personal data. There
are legal and ethical concerns with accessing such data.
So the data must be secured and access controlled as well as logged for
audits.” (Michael Blaha)
The main management challenges are

• Data privacy
• Security
Big Data 111

• Governance
• Ethical

The challenges are: Ensuring that data are used correctly (abiding by its
intended uses and relevant laws), tracking how the data are used, trans-
formed, derived, etc., and managing its lifecycle.

Big Data Platforms Technology: Current State of the Art


The industry is still in an immature state, experiencing an explosion of dif-
ferent technological solutions. Many of the technologies are far from robust
or enterprise ready, often requiring significant technical skills to support
the software even before analysis is attempted. At the same time, there is
a clear shortage of analytical experience to take advantage of the new data.
Nevertheless, the potential value is becoming increasingly clear.
In the past years, the motto was “rethinking the architecture”: scale and
performance requirements strain conventional databases.

“The problems are a matter of the underlying architecture. If not built


for scale from the ground-up a database will ultimately hit the wall—
this is what makes it so difficult for the established vendors to play in
this space because you cannot simply retrofit a 20+year-old architecture
to become a distributed MPP database over night,” says Florian Waas of
EMC/Greenplum [6].
  “In the Big Data era the old paradigm of shipping data to the applica-
tion isn’t working any more. Rather, the application logic must ‘come’ to
the data or else things will break: this is counter to conventional wisdom
and the established notion of strata within the database stack. With tera-
bytes, things are actually pretty simple—most conventional databases
scale to terabytes these days. However, try to scale to petabytes and it’s a
whole different ball game.” (Florian Waas)

This confirms Gray’s Laws of Data Engineering, adapted here to Big Data:

Take the Analysis to the Data!


In order to analyze Big Data, the current state of the art is a parallel database
or NoSQL data store, with a Hadoop connector. Hadoop is used for process-
ing the unstructured Big Data. Hadoop is becoming the standard platform for
doing large-scale processing of data in the enterprise. Its rate of growth far
exceeds any other “Big Data” processing platform.

What Is Apache Hadoop?


Hadoop provides a new open source platform to analyze and process Big
Data. It was inspired by Google’s MapReduce and Google File System (GFS)
papers. It is really an ecosystems of projects, including:
112 Big Data Computing

Higher-level declarative languages for writing queries and data analysis


pipelines, such as:

• Pig (Yahoo!)—relational-like algebra—(used in ca. 60% of Yahoo!


MapReduce use cases)
• PigLatin
• Hive (used by Facebook) also inspired by SQL—(used in ca. 90% of
Facebook MapReduce use cases)
• Jaql (IBM)
• Several other modules that include Load, Transform, Dump and
store, Flume Zookeeper Hbase Oozie Lucene Avro, etc.

Who Are the Hadoop Users?


A simple classification:

• Advanced users of Hadoop.


They are often PhDs from top universities with high expertise in
analytics, databases, and data mining. They are looking to go
beyond batch uses of Hadoop to support real-time streaming
of content. Product recommendations, ad placements, customer
churn, patient outcome predictions, fraud detection, and senti-
ment analysis are just a few examples that improve with real-
time information.
How many of such advanced users currently exist?
“There are only a few Facebook-sized IT organizations that can
have 60 Stanford PhDs on staff to run their Hadoop infrastruc-
ture. The others need it to be easier to develop Hadoop applica-
tions, deploy them and run them in a production environment.”
(JohnSchroeder [7])
So, not that many apparently.
• New users of Hadoop
They need Hadoop to become easier. Need it to be easier to develop
Hadoop applications, deploy them, and run them in a produc-
tion environment.
Organizations are also looking to expand Hadoop use cases to
include business critical, secure applications that easily integrate
with file-based applications and products.
With mainstream adoption comes, the need for tools that do not
require specialized skills and programers. New Hadoop devel-
opments must be simple for users to operate and to get data in
Big Data 113

and out. This includes direct access with standard protocols


using existing tools and applications.
Is there a real need for it? See also Big Data Myth later.

An Example of an Advanced User: Amazon


“We chose Hadoop for several reasons. First, it is the only available frame-
work that could scale to process 100s or even 1000s of terabytes of data and
scale to installations of up to 4000 nodes. Second, Hadoop is open source and
we can innovate on top of the framework and inside it to help our customers
develop more performant applications quicker.
Third, we recognized that Hadoop was gaining substantial popularity
in the industry with multiple customers using Hadoop and many vendors
innovating on top of Hadoop. Three years later we believe we made the right
choice. We also see that existing BI vendors such as Microstrategy are willing
to work with us and integrate their solutions on top of Elastic. MapReduce.”
(Werner Vogels, VP and CTO Amazon [3])

Big Data in Data Warehouse or in Hadoop?


Roughly speaking we have:

• Data warehouse: structured data, data “trusted”


• Hadoop: semistructured and unstructured data. Data “not trusted”

An interesting historical perspective of the development of Big Data comes


from Michael J. Carey [8]. He distinguishes between:

Big Data in the Database World (Early 1980s Till Now)

• Parallel Databases. Shared-nothing architecture, declarative set-­


oriented nature of relational queries, divide and conquer parallelism
(e.g., Teradata). Later phase re-implementation of relational data-
bases (e.g., HP/Vertica, IBM/Netezza, Teradata/Aster Data, EMC/
Greenplum, Hadapt)
  and

Big Data in the Systems World (Late 1990s Till Now)

• Apache Hadoop (inspired by Google GFS, MapReduce), contributed


by large Web companies. For example, Yahoo!, Facebook, Google
BigTable, Amazon Dynamo
114 Big Data Computing

The Parallel database software stack (Michael J. Carey) comprises

• SQL → SQL Compiler
• Relational Dataflow Layer (runs the query plans, orchestrate the local
storage managers, deliver partitioned, shared-nothing storage ser-
vices for large relational tables)
• Row/Column Storage Manager (record-oriented: made up of a set of
row-oriented or column-oriented storage managers per machine in a
cluster)

Note: no open-source parallel database exists! SQL is the only way into the
system architecture. Systems are monolithic: Cannot safely cut into them to
access inner functionalities.
The Hadoop software stack comprises (Michael J. Carey):

• HiveQL. PigLatin, Jaql script → HiveQL/Pig/Jaql (High-level


languages)
• Hadoop M/R job → Hadoop MapReduce Dataflow Layer/
• (for batch analytics, applies Map ops to the data in partitions of an
HDFS file, sorts, and redistributes the results based on key values in
the output data, then performs reduce on the groups of output data
items with matching keys from the map phase of the job).
• Get/Put ops → Hbase Key-value Store (accessed directly by client
app or via Hadoop for analytics needs)
• Hadoop Distributed File System (byte oriented file abstraction—
files appears as a very large contiguous and randomly addressable
sequence of bytes

Note: all tools are open-source! No SQL. Systems are not monolithic: Can
safely cut into them to access inner functionalities.
A key requirement when handling Big Data is scalability.
Scalability has three aspects

• data volume
• hardware size
• concurrency

What is the trade-off between scaling out and scaling up? What does it mean
in practice for an application domain?
Chris Anderson of Couchdb explains [11]: “scaling up is easier from a soft-
ware perspective. It’s essentially the Moore’s Law approach to scaling—buy
a bigger box. Well, eventually you run out of bigger boxes to buy, and then
you’ve run off the edge of a cliff. You’ve got to pray Moore keeps up.
Big Data 115

Scaling out means being able to add independent nodes to a system. This
is the real business case for NoSQL. Instead of being hostage to Moore’s Law,
you can grow as fast as your data. Another advantage to adding independent
nodes is you have more options when it comes to matching your workload.
You have more flexibility when you are running on commodity hardware—
you can run on SSDs or high-compute instances, in the cloud, or inside your
firewall.”

Enterprise Search
Enterprise Search implies being able to search multiple types of data gener-
ated by an enterprise. There are two alternatives: Apache Solr or implement-
ing a proprietary full-text search engine.
There is an ecosystem of open source tools that build on Apache Solr.

Big Data “Dichotomy”


The prevalent architecture that people use to analyze structured and
unstructured data is a two-system configuration, where Hadoop is used for
processing the unstructured data and a relational database system or an
NoSQL data store is used for the structured data as a front end.
NoSQL data stores were born when Developers of very large-scale user-
facing Web sites implemented key-value stores:

• Google Big Table


• Amazon Dynamo
• Apache Hbase (open source BigTable clone)
• Apache Cassandra, Riak (open source Dynamo clones), etc.

There are concerns about performance issues that arise along with the
transfer of large amounts of data between the two systems. The use of con-
nectors could introduce delays and data silos, and increase Total Cost of
Ownership (TCO).
Daniel Abadi of Hadapt says [10]: “this is a highly undesirable architecture,
since now you have two systems to maintain, two systems where data may be
stored, and if you want to do analysis involving data in both systems, you end
up having to send data over the network which can be a major bottleneck.”
Big Data is not (only) Hadoop.
“Some people even think that ‘Hadoop’ and ‘Big Data’ are synonymous
(though this is an over-characterization). Unfortunately, Hadoop was
designed based on a paper by Google in 2004 which was focused on use
cases involving unstructured data (e.g., extracting words and phrases from
Web pages in order to create Google’s Web index). Since it was not origi-
nally designed to leverage the structure in relational data in order to take
116 Big Data Computing

short-cuts in query processing, its performance for processing relational data


is therefore suboptimal” says Daniel Abadi of Hadapt.
Duncan Ross of Teradata confirms this: “the biggest technical challenge is
actually the separation of the technology from the business use! Too often
people are making the assumption that Big Data is synonymous with Hadoop,
and any time that technology leads business things become difficult. Part of
this is the difficulty of use that comes with this.
It’s reminiscent of the command line technologies of the 70s—it wasn’t
until the GUI became popular that computing could take off.”

Hadoop and the Cloud


Amazon has a significant web-services business around Hadoop.
But in general, people are concerned with the protection and security of
their data. What about traditional enterprises?
Here is an attempt to list the pros and cons of Hadoop.

Hadoop Pros
• Open source.
• Nonmonolithic support for access to file-based external data.
• Support for automatic and incremental forward-recovery of jobs
with failed task.
• Ability to schedule very large jobs in smaller chunks.
• Automatic data placement and rebalancing as data grows and
machines come and go.
• Support for replication and machine fail-over without operation
intervention.
• The combination of scale, ability to process unstructured data along
with the availability of machine learning algorithms, and recom-
mendation engines create the opportunity to build new game chang-
ing applications.
• Does not require a schema first.
• Provides a great tool for exploratory analysis of the data, as long as
you have the software development expertise to write MapReduce
programs.

Hadoop Cons
• Hadoop is difficult to use.
• Can give powerful analysis, but it is fundamentally a batch-oriented
paradigm. The missing piece of the Hadoop puzzle is accounting for
real-time changes.
Big Data 117

• Hadoop file system (HDS) has a centralized metadata store


(NameNode), which represents a single point of failure without
availability. When the NameNode is recovered, it can take a long
time to get the Hadoop cluster running again.
• Hadoop assumes that the workload it runs will belong running, so
it makes heavy use of checkpointing at intermediate stages. This
means parts of a job can fail, be restarted, and eventually complete
successfully—there are no transactional guarantees.

Current Hadoop distributions challenges

• Getting data in and out of Hadoop. Some Hadoop distributions are


limited by the append-only nature of the Hadoop Distributed File
System (HDFS) that requires programs to batch load and unload data
into a cluster.
• The lack of reliability of current Hadoop software platforms is a
major impediment for expansion.
• Protecting data against application and user errors.
• Hadoop has no backup and restore capabilities. Users have to con-
tend with data loss or resort to very expensive solutions that reside
outside the actual Hadoop cluster.

There is work in progress to fix this from vendors of commercial Hadoop


distributions (e.g., MapR, etc.) by reimplementing Hadoop components.
It would be desirable to have seamless integration.

“Instead of stand-alone products for ETL,BI/reporting and analytics we


have to think about seamless integration: in what ways can we open up
a data processing platform to enable applications to get closer? What lan-
guage interfaces, but also what resource management facilities can we
offer? And so on.” (Florian Waas)

Daniel Abadi: “A lot of people are using Hadoop as a sort of data refinery.
Data starts off unstructured, and Hadoop jobs are run to clean, transform,
and structure the data. Once the data is structured, it is shipped to SQL
databases where it can be subsequently analyzed. This leads to the raw data
being left in Hadoop and the refined data in the SQL databases. But it’s basi-
cally the same data—one is just a cleaned (and potentially aggregated) ver-
sion of the other. Having multiple copies of the data can lead to all kinds of
problems. For example, let’s say you want to update the data in one of the
two locations—it does not get automatically propagated to the copy in the
other silo. Furthermore, let’s say you are doing some analysis in the SQL
database and you see something interesting and want to drill down to the
raw data—if the raw data is located on a different system, such a drill down
118 Big Data Computing

becomes highly nontrivial. Furthermore, data provenance is a total night-


mare. It’s just a really ugly architecture to have these two systems with a
connector between them.”
Michael J. Carey adds that is:

• Questionable to layer a record-oriented data abstraction on top of a


giant globally sequenced byte-stream file abstraction.

(E.g., HDFS is unaware of record boundaries. “Broken records” instead of


fixed-length file splits, i.e., a record with some of its bytes in one split and
some in the next)

• Questionable building a parallel data runtime on top of a unary


operator model (map, reduce, combine). E.g., performing joins with
MapReduce.
• Questionable building a key-value store layer with a remote query
access at the next layer. Pushing queries down to data is likely to
outperform pulling data up to queries.
• Lack of schema information, today is flexible, but a recipe for future
difficulties. E.g., future maintainers of applications will likely have
problems in fixing bugs related to changes or assumptions about the
structure of data files in HDFS. (This was one of the very early les-
sons in the DB world).
• Not addressed single system performance, focusing solely on
scale-out.

Technological Solutions for Big Data Analytics


There are several technological solutions available in the market for Big Data
Analytics. Here are some examples:

An NoSQL Data Store (CouchBase, Riak, Cassandra, MongoDB, etc.) Connected to


Hadoop

With this solution, an NoSQL data store is used as a front end to process
selected data in real time data, and having Hadoop in the back end process-
ing Big Data in batch mode.

“In my opinion the primary interface will be via the real time store,
and the Hadoop layer will become a commodity. That is why there is
so much competition for the NoSQL brass ring right now” says J. Chris
Anderson of Couchbase (an NoSQL datastore).

In some applications, for example, Couchbase (NoSQL) is used to enhance


the batch-based Hadoop analysis with real-time information, giving the
effect of a continuous process. Hot data live in Couchbase in RAM.
Big Data 119

The process consists of essentially moving the data out of Couchbase


into Hadoop when it cools off. CouchDB supplies a connector to Apache
Sqoop (a Top-Level Apache project since March of 2012), a tool designed
for efficiently transferring bulk data between Hadoop and relational
databases.

An NewSQL Data Store for Analytics (HP/Vertica) Instead of Hadoop

Another approach is to use a NewSQL data store designed for Big Data
Analytics, such as HP/Vertica. Quoting Shilpa Lawande [4] “Vertica was
designed from the ground up for analytics.” Vertica is a columnar database
engine including sorted columnar storage, a query optimizer, and an execu-
tion engine, providing standard ACID transaction semantics on loads and
queries.
With sorted columnar storage, there are two methods that drastically
reduce the I/O bandwidth requirements for such Big Data analytics work-
loads. The first is that Vertica only reads the columns that queries need.
Second, Vertica compresses the data significantly better than anyone else.
Vertica’s execution engine is optimized for modern multicore processors
and we ensure that data stays compressed as much as possible through
the query execution, thereby reducing the CPU cycles to process the query.
Additionally, we have a scale-out MPP architecture, which means you can
add more nodes to Vertica.
All of these elements are extremely critical to handle the data volume chal-
lenge. With Vertica, customers can load several terabytes of data quickly (per
hour in fact) and query their data within minutes of it being loaded—that is
real-time analytics on Big Data for you.
There is a myth that columnar databases are slow to load. This may have
been true with older generation column stores, but in Vertica, we have a
hybrid in-memory/disk load architecture that rapidly ingests incoming data
into a write-optimized row store and then converts that to read-optimized
sorted columnar storage in the background. This is entirely transparent to
the user because queries can access data in both locations seamlessly. We
have a very lightweight transaction implementation with snapshot isolation
queries can always run without any locks.
And we have no auxiliary data structures, like indices or material-
ized views, which need to be maintained postload. Last, but not least, we
designed the system for “always on,” with built-in high availability features.
Operations that translate into downtime in traditional databases are online
in Vertica, including adding or upgrading nodes, adding or modifying data-
base objects, etc. With Vertica, we have removed many of the barriers to mon-
etizing Big Data and hope to continue to do so.
“Vertica and Hadoop are both systems that can store and analyze large
amounts of data on commodity hardware. The main differences are how the
data get in and out, how fast the system can perform, and what transaction
120 Big Data Computing

guarantees are provided. Also, from the standpoint of data access, Vertica’s
interface is SQL and data must be designed and loaded into an SQL schema
for analysis. With Hadoop, data is loaded AS IS into a distributed file sys-
tem and accessed programmatically by writing Map-Reduce programs.”
(Shilpa Lawande [4])

A NewSQL Data Store for OLTP (VoltDB) Connected with Hadoop or a Data
Warehouse

With this solution, a fast NewSQL data store designed for OLTP (VoltDB) is
connected to either a conventional data warehouse or Hadoop.
“We identified 4 sources of significant OLTP overhead (concurrency con-
trol, write-ahead logging, latching and buffer pool management).
Unless you make a big dent in ALL FOUR of these sources, you will not
run dramatically faster than current disk-based RDBMSs. To the best of my
knowledge, VoltDB is the only system that eliminates or drastically reduces
all four of these overhead components. For example, TimesTen uses conven-
tional record level locking, an Aries-style write ahead log and conventional
multi-threading, leading to substantial need for latching. Hence, they elimi-
nate only one of the four sources.
VoltDB is not focused on analytics. We believe they should be run on a
companion data warehouse. Most of the warehouse customers I talk to want
to keep increasing large amounts of increasingly diverse history to run their
analytics over. The major data warehouse players are routinely being asked
to manage petabyte-sized data warehouses. VoltDB is intended for the OLTP
portion, and some customers wish to run Hadoop as a data warehouse plat-
form. To facilitate this architecture, VoltDB offers a Hadoop connector.
VoltDB supports standard SQL. Complex joins should be run on a com-
panion data warehouse. After all, the only way to interleave ‘big reads’
with ‘small writes’ in a legacy RDBMS is to use snapshot isolation or run
with a reduced level of consistency. You either get an out-of-date, but con-
sistent answer or an up-to-date, but inconsistent answer. Directing big
reads to a companion DW, gives you the same result as snapshot isolation.
Hence, I do not see any disadvantage to doing big reads on a companion
system.
Concerning larger amounts of data, our experience is that OLTP problems
with more than a few Tbyte of data are quite rare. Hence, these can easily fit
in main memory, using a VoltDB architecture.
In addition, we are planning extensions of the VoltDB architecture to han-
dle larger-than-main-memory data sets.” (Mike Stonebraker [13])

A NewSQL for Analytics (Hadapt) Complementing Hadoop

An alternative solution is to use a NewSQL designed for analytics (Hadapt)


which complements Hadoop.
Big Data 121

Daniel Abadi explains “at Hadapt, we’re bringing 3 decades of relational


database research to Hadoop. We have added features like indexing, co-
partitioned joins, broadcast joins, and SQL access (with interactive query
response times) to Hadoop, in order to both accelerate its performance for
queries over relational data and also provide an interface that third party
data processing and business intelligence tools are familiar with.
Therefore, we have taken Hadoop, which used to be just a tool for super-
smart data scientists, and brought it to the mainstream by providing a high
performance SQL interface that business analysts and data analysis tools
already know how to use. However, we’ve gone a step further and made it
possible to include both relational data and non-relational data in the same
query; so what we’ve got now is a platform that people can use to do really
new and innovative types of analytics involving both unstructured data like
tweets or blog posts and structured data such as traditional transactional
data that usually sits in relational databases.
What is special about the Hadapt architecture is that we are bringing data-
base technology to Hadoop, so that Hadapt customers only need to deploy a
single cluster—a normal Hadoop cluster—that is optimized for both struc-
tured and unstructured data, and is capable of pushing the envelope on the
type of analytics that can be run over Big Data.” [10]

A Combinations of Data Stores: A Parallel Database (Teradata) and Hadoop

An example of this solution is the architecture for Complex Analytics at eBay


(Tom Fastner [12])
The use of analytics at Ebay is rapidly changing, and analytics is driv­
ing many key initiatives like buyer experience, search optimization, buyer
protection, or mobile commerce. EBay is investing heavily in new tech-
nologies and approaches to leverage new data sources to drive innovation.
EBay uses three different platforms for analytics:

1. “EDW”: dual systems for transactional (structured data); Teradata


6690 with 9.5 PB spinning disk and 588 TB SSD
• The largest mixed storage Teradata system worldwide; with
spool, some dictionary tables and user data automatically
managed by access frequency to stay on SSD.10+ years experi-
ence; very high concurrency; good accessibility; hundreds of
applications.
2. “Singularity”: deep Teradata system for semistructured data; 36 PB
spinning disk;
• Lower concurrency that EDW, but can store more data; biggest
use case is User Behavior Analysis; largest table is 1.2 PB with
~3 Trillion rows.
122 Big Data Computing

3. Hadoop: for unstructured/complex data; ~40 PB spinning disk;


• Text analytics, machine learning, has the user behavior data and
selected EDW tables; lower concurrency and utilization.

The main technical challenges for Big Data analytics at eBay are

• I/O bandwidth: limited due to configuration of the nodes.


• Concurrency/workload management: Workload management tools usu-
ally manage the limited resource. For many years, EDW systems
bottleneck on the CPU; big systems are configured with ample CPU
making I/O the bottleneck. Vendors are starting to put mechanisms
in place to manage I/O, but it will take sometime to get to the same
level of sophistication.
• Data movement (loads, initial loads, backup/restores): As new platforms
are emerging you need to make data available on more systems chal-
lenging networks, movement tools, and support to ensure scalable
operations that maintain data consistency.

Scalability and Performance at eBay

• EDW: models for the unknown (close to third NF) to provide a solid
physical data model suitable for many applications, which limits
the number of physical copies needed to satisfy specific application
requirements.

A lot of scalability and performance is built into the database, but as any
shared resource it does require an excellent operations team to fully leverage
the capabilities of the platform

• Singularity: The platform is identical to EDW, the only exception


are limitations in the workload management due to configuration
choices.

But since they are leveraging the latest database release, they are exploring
ways to adopt new storage and processing patterns. Some new data sources
are stored in a denormalized form significantly simplifying data model-
ing and ETL. On top they developed functions to support the analysis of
the semistructured data. It also enables more sophisticated algorithms that
would be very hard, inefficient, or impossible to implement with pure SQL.
One example is the pathing of user sessions. However, the size of the data
requires them to focus more on best practices (develop on small subsets, use
1% sample; process by day).

• Hadoop: The emphasis on Hadoop is on optimizing for access. There


usability of data structures (besides “raw” data) is very low
Big Data 123

Unstructured Data
Unstructured data are handled on Hadoop only. The data are copied from
the source systems into HDFS for further processing. They do not store any
of that on the Singularity (Teradata) system.
Use of Data management technologies:

• ETL: AbInitio, home-grown parallel Ingest system


• Scheduling: UC4
• Repositories: Teradata EDW; Teradata Deep system; Hadoop
• BI: Microstrategy, SAS, Tableau, Excel
• Data Modeling: Power Designer
• Ad hoc: Teradata SQL Assistant; Hadoop Pig and Hive
• Content Management: Joomla-based

Cloud Computing and Open Source


“We do leverage internal cloud functions for Hadoop; no cloud for Teradata.
Open source: committers for Hadoop and Joomla; strong commitment to
improve those technologies.” (Tom Fastner, Principal Architect at eBay)

Big Data Myth


It is interesting to report here what Marc Geall, a research analyst at
Deutsche Bank AG/in London, writes about the “Big Data Myth,” and pre-
dicts [9]:
“We believe that in-memory/NewSQL is likely to be the prevalent data-
base model rather than NoSQL due to three key reasons:

1. The limited need of petabyte-scale data today even among the


NoSQL deployment base,
2. Very low proportion of databases in corporate deployment which
requires more than tens of TB of data to be handles,
3. Lack of availability and high cost of highly skilled operators (often
post-doctoral) to operate highly scalable NoSQL clusters.”

Time will tell us whether this prediction is accurate or not.

Main Research Challenges and Business Challenges


We conclude this part of the chapter by looking at three elements: data, plat-
form, and analysis with two quotes:
Werner Vogels: “I think that sharing is another important aspect to the
mix. Collaborating during the whole process of collecting data, storing
124 Big Data Computing

it, organizing it and analyzing it is essential. Whether it’s scientists in a


research field or doctors at different hospitals collaborating on drug tri-
als, they can use the cloud to easily share results and work on common
datasets.”
Daniel Abadi: “Here are a few that I think are interesting:

1. Scalability of non-SQL analytics. How do you parallelize clustering,


classification, statistical, and algebraic functions that are not ‘embar-
rassingly parallel’ (that have traditionally been performed on a sin-
gle server in main memory) over a large cluster of shared-nothing
servers.
2. Reducing the cognitive complexity of ‘Big Data’ so that it can fit in
the working set of the brain of a single analyst who is wrangling
with the data.
3. Incorporating graph data sets and graph algorithms into database
management systems.
4. Enabling platform support for probabilistic data and probabilistic
query processing.’’

Big Data for the Common Good


“As more data become less costly and technology breaks barrier to acquisi-
tion and analysis, the opportunity to deliver actionable information for civic
purposed grow. This might be termed the ‘common good’ challenge for Big
Data.” (Jake Porway, DataKind)
Very few people seem to look at how Big Data can be used for solving
social problems. Most of the work in fact is not in this direction.
Why this? What can be done in the international research/development
community to make sure that some of the most brilliant ideas do have an
impact also for social issues?
In the following, I will list some relevant initiatives and selected thoughts
for Big Data for the Common Good.

World Economic Forum, the United Nations Global Pulse Initiative


The United Nations Global Pulse initiative is one example. Earlier this year
at the 2012 Annual Meeting in Davos, the World Economic Forum pub-
lished a white paper entitled “Big Data, Big Impact: New Possibilities for
International Development.” The WEF paper lays out several of the ideas
which fundamentally drive the Global Pulse initiative and presents in con-
crete terms the opportunity presented by the explosion of data in our world
Big Data 125

today, and how researchers and policy-makers are beginning to realize the
potential for leveraging Big Data to extract insights that can be used for
Good, in particular, for the benefit of low-income populations.
“A flood of data is created every day by the interactions of billions of peo-
ple using computers, GPS devices, cell phones, and medical devices. Many
of these interactions occur through the use of mobile devices being used by
people in the developing world, people whose needs and habits have been
poorly understood until now.
Researchers and policymakers are beginning to realize the potential for
channeling these torrents of data into actionable information that can be
used to identify needs, provide services, and predict and prevent crises for
the benefit of low-income populations. Concerted action is needed by gov-
ernments, development organizations, and companies to ensure that this
data helps the individuals and communities who create it.”
Three examples are cited in WEF paper:

• UN Global Pulse: an innovation initiative of the UN Secretary General,


harnessing today’s new world of digital data and real-time analyt-
ics to gain better understanding of changes in human well-being
(www.unglobalpulse.orgGlobal).
• Viral Forecasting: a not-for-profit whose mission is to promote under-
standing, exploration, and stewardship of the microbial world
(www.gvfi.orgUshadi).
• SwiftRiver Platform: a non-profit tech company that specializes in
developing free and open source software for information collec-
tion, visualization, and interactive mapping (http://ushahidi.com).

What Are the Main Difficulties, Barriers Hindering Our Community


to Work on Social Capital Projects?
I have listed below some extracts from [5]:

• Alon Havely (Google Research): “I don’t think there are particular


barriers from a technical perspective. Perhaps the main barrier is
ideas of how to actually take this technology and make social impact.
These ideas typically don’t come from the technical community, so
we need more inspiration from activists.”
• Laura Haas: (IBM Research): “Funding and availability of data are
two big issues here. Much funding for social capital projects comes
from governments—and as we know, are but a small fraction of
the overall budget. Further, the market for new tools and so on
that might be created in these spaces is relatively limited, so it is
not always attractive to private companies to invest. While there is a
lot of publicly available data today, often key pieces are missing, or
126 Big Data Computing

privately held, or cannot be obtained for legal reasons, such as the


privacy of individuals, or a country’s national interests. While this is
clearly an issue for most medical investigations, it crops up as well
even with such apparently innocent topics as disaster management
(some data about, e.g., coastal structures, may be classified as part of
the national defense).”
• Paul Miller (Consultant): “Perceived lack of easy access to data that’s
unencumbered by legal and privacy issues? The large-scale and long
term nature of most of the problems? It’s not as ‘cool’ as something
else? A perception (whether real or otherwise) that academic fund-
ing opportunities push researchers in other directions? Honestly,
I’m not sure that there are significant insurmountable difficulties or
barriers, if people want to do it enough. As Tim O’Reilly said in 2009
(and many times since), developers should ‘Work on stuff that mat-
ters.’ The same is true of researchers.”
• Roger Barga (Microsot Research): “The greatest barrier may be social.
Such projects require community awareness to bring people to take
action and often a champion to frame the technical challenges in
a way that is approachable by the community. These projects will
likely require close collaboration between the technical community
and those familiar with the problem.”

What Could We Do to Help Supporting Initiatives for Big Data


for Good?
I have listed below some extracts from [5]:

• Alon Havely (Google Research): “Building a collection of high qual-


ity data that is widely available and can serve as the backbone for
many specific data projects. For example, datasets that include
boundaries of countries/counties and other administrative regions,
data sets with up-to-date demographic data. It’s very common that
when a particular data story arises, these data sets serve to enrich it.”
• Laura Haas (IBM Research): “Increasingly, we see consortiums of
institutions banding together to work on some of these problems.
These Centers may provide data and platforms for data-intensive
work, alleviating some of the challenges mentioned above by acquir-
ing and managing data, setting up an environment and tools, bring-
ing in expertise in a given topic, or in data, or in analytics, providing
tools for governance, etc.
  My own group is creating just such a platform, with the goal of
facilitating such collaborative ventures. Of course, lobbying our gov-
ernments for support of such initiatives wouldn’t hurt!”
Big Data 127

• Paul Miller (Consultant): “Match domains with a need to research-


ers/companies with a skill/product. Activities such as the recent Big
Data Week Hackathons might be one route to follow—encourage the
organisers (and companies like Kaggle, which do this every day) to
run Hackathons and competitions that are explicitly targeted at a
‘social’ problem of some sort. Continue to encourage the Open Data
release of key public data sets. Talk to the agencies that are working
in areas of interest, and understand the problems that they face. Find
ways to help them do what they already want to do, and build trust
and rapport that way.”
• Roger Barga (Microsot Research): “Provide tools and resources to
empower the long tail of research. Today, only a fraction of scien-
tists and engineers enjoy regular access to high performance and
data-intensive computing resources to process and analyze massive
amounts of data and run models and simulations quickly. The real-
ity for most of the scientific community is that’s peed to discovery
is often hampered as they have to either queue up for access to lim-
ited resources or pare down the scope of research to accommodate
available processing power. This problem is particularly acute at the
smaller research institutes which represent the long tail of the research
community. Tier 1 and some tier 2 universities have sufficient funding
and infrastructure to secure and support computing resources while
the smaller research programs struggle. Our funding agencies and
corporations must provide resources to support researchers, in par-
ticular those who do not have access to sufficient resources.”

Conclusions: The Search for Meaning Behind Our Activities


I would like to conclude this chapter with this quote below which I find
inspiring.

“All our activities in our lives can be looked at from different perspec-
tives and within various contexts: our individual view, the view of our
families and friends, the view of our company and finally the view of
society—the view of the world. Which perspective means what to us
is not always clear, and it can also change over the course of time. This
might be one of the reasons why our life sometimes seems unbalanced.
We often talk about work-life balance, but maybe it is rather an imbal-
ance between the amount of energy we invest into different elements of
our life and their meaning to us.”
—Eran Davidson, CEO Hasso Plattner Ventures
128 Big Data Computing

Acknowledgments
I would like to thank Michael Blaha, Rick Cattell, Michael Carey, Akmal
Chaudhri, Tom Fastner, Laura Haas, Alon Halevy, Volker Markl, Dave
Thomas, Duncan Ross, Cindy Saracco, Justin Sheehy, Mike OSullivan, Martin
Verlage, and Steve Vinoski for their feedback on an earlier draft of this chapter.
But all errors and missing information are mine.

References
1. McKinsey Global Institute (MGI), Big Data: The next frontier for innovation,
competition, and productivity, Report, June, 2012.
2. Managing Big Data. An interview with David Gorbet ODBMS Industry Watch,
July 2, 2012. http://www.odbms.org/blog/2012/07/managing-big-data-an-
interview-with-david-gorbet/
3. On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.
com. ODBMS Industry Watch, November 2, 2011. http://www.odbms.org/
blog/2011/11/on-big-data-interview-with-dr-werner-vogels-cto-and-vp-of-
amazon-com/
4. On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica.
ODBMs Industry Watch, November 16, 2011.
5. “Big Data for Good”, Roger Barca, Laura Haas, Alon Halevy, Paul Miller,
Roberto V. Zicari. ODBMS Industry Watch, June 5, 2012.
6. On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum. ODBMS
Industry Watch, February 1, 2012.
7. Next generation Hadoop—interview with John Schroeder. ODBMS Industry
Watch, September 7, 2012.
8. Michael J. Carey, EDBT keynote 2012, Berlin.
9. Marc Geall, “Big Data Myth”, Deutsche Bank Report 2012.
10. On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. ODBMS
Industry Watch, December 5, 2012.
11. Hadoop and NoSQL: Interview with J. Chris Anderson. ODBMS Industry Watch,
September 19, 2012.
12. Analytics at eBay. An interview with Tom Fastner. ODBMS Industry Watch,
October 6, 2011.
13. Interview with Mike Stonebraker. ODBMS Industry Watch, May 2, 2012.

Links:
ODBMS.org www.odbms.org
ODBMS Industry Watch, www.odbms.org/blog
Section II

Semantic Technologies
and Big Data
This page intentionally left blank
4
Management of Big Semantic Data

Javier D. Fernández, Mario Arias, Miguel A. Martínez-Prieto,


and Claudio Gutiérrez

Contents
Big Data................................................................................................................. 133
What Is Semantic Data?....................................................................................... 135
Describing Semantic Data.............................................................................. 135
Querying Semantic Data................................................................................ 136
Web of (Linked) Data........................................................................................... 137
Linked Data...................................................................................................... 138
Linked Open Data........................................................................................... 139
Stakeholders and Processes in Big Semantic Data.......................................... 140
Participants and Witnesses............................................................................ 141
Workflow of Publication-Exchange-Consumption.................................... 144
State of the Art for Publication-Exchange-Consumption.......................... 146
An Integrated Solution for Managing Big Semantic Data............................. 148
Encoding Big Semantic Data: HDT............................................................... 149
Querying HDT-Encoded Data Sets: HDT-FoQ........................................... 154
Experimental Results........................................................................................... 156
Publication Performance................................................................................ 157
Exchange Performance................................................................................... 159
Consumption Performance............................................................................ 159
Conclusions and Next Steps............................................................................... 162
Acknowledgments............................................................................................... 163
References.............................................................................................................. 164

In 2007, Jim Gray preached about the effects of the Data deluge in the sciences
(Hey et al. 2009). While experimental and theoretical paradigms originally
led science, some natural phenomena were not easily addressed by analyti-
cal models. In this scenario, computational simulation arose as a new para-
digm enabling scientists to deal with these complex phenomena. Simulation
produced increasing amounts of data, particularly from the use of advanced
exploration instruments (large-scale telescopes, particle colliders, etc.) In this
scenario, scientists were no longer interacting directly with the phenomena,

131
132 Big Data Computing

but used powerful computational configurations to analyze the data gath-


ered from simulations or captured by instruments. Sky maps built from the
Sloan Digital Sky Survey observations, or the evidences found about the
Higgs Boson are just two successful stories of just another paradigm, what
Gray called the fourth paradigm: the eScience.
eScience sets the basis for scientific data exploration and identifies the com-
mon problems that arise when dealing with data at large scale. It deals with
the complexities of the whole scientific data workflow, from the data cre-
ation and capture, through the organization and sharing of these data with
other scientists, to the final processing and analysis of such data. Gray linked
these problems to the way in which data are encoded “because the only way
that scientists are going to be able to understand that information is if their
software can understand the information.” In this way, data representation
emerges as one of the key factors in the process of storing, organizing, filter-
ing, analyzing, and visualizing data at large scale, but also for sharing and
exchanging them in the distributed scientific environment.
Despite its origins in science, the data deluge effects apply to many other
fields. It is easy to find real cases of massive data sources, many of them are
part of our everyday lives. Common activities, such as adding new friends
on social networks, sharing photographs, buying something electronically,
or clicking in any result returned from a search engine, are continuously
recorded in increasingly large data sets. Data is the new “raw material of
business”.
Although business is one of the major contributors to the data deluge, there
are many other players that should not go unnoticed. The Open Government
movement, around the world, is also converting public administrations in
massive data generators. In recent years, they have released large data sets
containing educational, political, economic, criminal, census information,
among many others. Besides, we are surrounded by multitude of sensors
which continuously report information about temperature, pollution, energy
consumption, the state of the traffic, the presence or absence of a fire, etc.
Any information anywhere and in anytime is recorded in big and constantly
evolving heterogeneous data sets that take part in the data deluge. If we add
the scientific contributions, the data sets released by traditional and digital
libraries, geographical data or collections from mass media, we can see that
the data deluge is definitely an ubiquitous revolution.
From the original eScience has evolved what has been called data science
(Loukides 2012), a discipline that cope with this ubiquity, and basically refers
to the science of transforming data in knowledge. The acquisition of this
knowledge strongly depends on the existence of an effective data linkage,
which enables computers for integrating data from heterogeneous data sets.
We bump again with the question of how information is encoded for differ-
ent kinds of automatic processing.
Definitively, data and information standards are at the ground of this rev-
olution, and due to its size, semiautomatic processing of them is essential.
Management of Big Semantic Data 133

An algorithmic (and standardized) data encoding is crucial to enable com-


puter exchange and understanding; for instance, this data representation
must allow computers to resolve what a gene is or what a galaxy is, or
what a temperature measurement is (Hey et al. 2009). Nowadays, the use of
graph-oriented representations and rich semantic vocabularies are gaining
momentum. On the one hand, graphs are flexible models for integrating
data not only with different degrees of structure, but also enable these het-
erogeneous data to be linked in a uniform way. On the other hand, vocab-
ularies describe what data mean. The most practical trend, in this line,
suggests the use of the Resource Description Framework: RDF (Manola
and Miller 2004), a standard model for data encoding and semantic tech-
nologies for publication, exchange, and consumption of this Big Semantic
Data at universal scale.
This chapter takes a guided tour to the challenges of Big Semantic Data
management and the role that it plays in the emergent Web of Data. Section
“Big Data” provides a brief overview of Big Data and its dimensions. Section
“What is Semantic Data?” summarizes the semantic web foundations and
introduces the main technologies used for describing and querying seman-
tic data. These basics set the minimal background for understanding the
notion of web of data. It is presented in section “The Web of (Linked) Data”
along with the Linked Data project and its open realization within the
Linked Open Data movement. Section “Stakeholders and Processes in Big
Semantic Data” characterizes the stakeholders and the main data flows per-
formed in this web of Data: publication, exchange, and consumption, defines
them and delves not only into their potential for data interoperability, but
also in the scalability drawbacks arising when Big Semantic Data must be
processed and queried. Innovative compression techniques are introduced
in section “An Integrated Solution for Managing Big Semantic Data,” show-
ing how the three Big Data dimensions (volume, velocity, and variety) can
be successfully addressed through an integrated solution, called HDT
(Header-Dictionary-Triples). Section “Experimental Results” comprises our
experimental results, showing that HDT allows scalability improvements to
be achieved for storage, exchange, and query answering of such emerging
data. Finally, section “Conclusions and Next Steps” concludes and devises
the potential of HDT for its progressive adoption in Big Semantic Data
management.

Big Data
Much has been said and written these days about Big Data. News in rel-
evant magazines (Cukier 2010; Dumbill 2012b; Lohr 2012), technical reports
(Selg 2012) and white papers from leading enterprises (Dijcks 2012), some
134 Big Data Computing

emergent research works in newly established conferences,* disclosure


books (Dumbill 2012a), and more applied ones (Marz and Warren 2013) are
flooding us with numerous definitions, problems, and solutions related to
Big Data. It is, obviously, a trending topic in technological scenarios, but it
also is producing political, economical, and scientific impact.
We will adopt in this article a simple Big Data characterization. We refer
to Big Data as “the data that exceed the processing capacity of conventional
database systems” (Dumbill 2012b). Thus, any of these huge data sets gener-
ated in the data deluge may be considered Big Data. It is clear that they are
too big, they move too fast, and they do not fit, generally, the relational model
strictures (Dumbill 2012b). Under these considerations, Big Data result in the
convergence of the following three V’s:

Volume is the most obvious dimension because of a large amount of


data continuously gathered and stored in massive data sets exposed
for different uses and purposes. Scalability is the main challenge
related to Big Data volume by considering that effective storage
mechanisms are the first requirement in this scenario. It is worth
noting that storage decisions influence data retrieval, the ultimate
goal for the user, that expects it to be performed as fast as possible,
especially in real-time systems.
Velocity describes how data flow, at high rates, in an increasingly dis-
tributed scenario. Nowadays, velocity increases in a similar way
than volume. Streaming data processing is the main challenge related
to this dimension because selective storage is mandatory for practi-
cal volume management, but also for real-time response.
Variety refers to various degrees of structure (or lack thereof) within
the source data (Halfon 2012). This is mainly due to Big Data may
come from multiple origins (e.g., sciences, politics, economy, social
networks, or web server logs, among others) and each one describes
its own semantics, hence data follow a specific structural modeling.
The main challenge of Big Data variety is to achieve an effective
mechanism for linking diverse classes of data differing in the inner
structure.

While volume and velocity address physical concerns, variety refers to a


logical question mainly related to the way in which data are modeled for
enabling effective integration. It is worth noting that the more data are inte-
grated, the more interesting knowledge may be generated, increasing the
resulting data set value. Under these considerations, one of the main objec-
tives in Big Data processing is to increase data value as much as possible by
directly addressing the Big Data variety. As mentioned, the use of semantic

* Big Data conferences: http://lanyrd.com/topics/big-data/


Management of Big Semantic Data 135

technologies seems to be ahead in this scenario, leading to the publication of


big semantic data sets.

What Is Semantic Data?


Semantic data have been traditionally related to the concept of Semantic Web.
The Semantic Web enhances the current WWW by incorporating machine-
processable semantics to their information objects (pages, services, data
sources, etc.). Its goals are summarized as follows:

1. To give semantics to information on the WWW. The difference between


the approach of information retrieval techniques (that currently
dominate WWW information processing) and database ones is that
in the latter data are structured via schemas that are essentially
metadata. Metadata gives the meaning (the semantics) to data, allow-
ing structured query, that is, querying data with logical meaning
and precision.
2. To make semantic data on the WWW machine-processable. Currently,
on the WWW the semantics of the data is given by humans (either
directly during manual browsing and searching, or indirectly via
information retrieval algorithms which use human feedback entered
via static links or logs of interactions). Although it is currently suc-
cessful, this process has known limitations (Quesada 2008). For Big
Data, it is crucial to automatize the process of “understanding”
(giving meaning to) data on the WWW. This amounts to develop
machine-processable semantics.

To fulfill these goals, the Semantic Web community and the World Wide
Consortium (W3C)* have developed (i) models and languages for represent-
ing the semantics and (ii) protocols and languages for querying it. We will
briefly describe them in the next items.

Describing Semantic Data


Two families of languages sufficiently flexible, distributively extensible, and
machine-processable have been developed for describing semantic data.

1. The Resource Description Framework (RDF) (Manola and Miller 2004).


It was designed to have a simple data model, with a formal seman-
tics, with an extensible URI-based vocabulary, and which allows
* http://www.w3.org
136 Big Data Computing

anyone to distributedly make statements about any resource on the


Web. In this regard, an RDF description turns out to be a set of URI
triples, with the standard intended meaning. It follows the ideas of
semantic networks and graph data specifications, based on univer-
sal identifiers. It gives basic tools for linking data, plus a lightweight
machinery for coding basic meanings. It has two levels:
a. Plain RDF is the basic data model for resources and relations
between them. It is based on a basic vocabulary: a set of prop-
erties, technically binary predicates. Formally, it consists of tri-
ples of the form (s,p,o) (subject–predicate–object), where s,p,o are
URIs that use distributed vocabularies. Descriptions are state-
ments in the subject–predicate–object structure, where predicate
and object are resources or strings. Both subject and object can
be anonymous entities (blank nodes). Essentially, RDF builds
graphs labeled with meaning.
b. RDFS adds over RDF a built-in vocabulary with a normative
semantics, the RDF Schema (Brickley 2004). This vocabulary
deals with inheritance of classes and properties, as well as typ-
ing, among other features. It can be thought of as a lightweight
ontology.
2. The Web Ontology Language (OWL) (McGuinness and van Harmelen
2004). It is a version of logic languages adapted to cope with the Web
requirements, composed of basic logic operators plus a mechanism
for defining meaning in a distributed fashion.

From a metadata point of view, OWL can be considered a rich vocabulary


with high expressive power (classes, properties, relations, cardinality, equal-
ity, constraints, etc.). It comes in many flavors, but this gain in expressive
power is at the cost of scalability (complexity of evaluation and processing).
In fact, using the semantics of OWL amounts to introduce logical reasoning
among pieces of data, thus exploiting in complexity terms.

Querying Semantic Data


If one has scalability in mind, due to complexity arguments, the expressive
power of the semantics should stay at a basic level of metadata, that is, plain
RDF. This follows from the W3C design principles of interoperability, exten-
sibility, evolution, and decentralization.
As stated, RDF can be seen as a graph labeled with meaning, in which each
triple (s,p,o) is represented as a direct edge-labeled graph s p
→ o . The data
model RDF has a corresponding query language, called SPARQL. SPARQL
(Prud’hommeaux and Seaborne 2008) is the W3C standard for querying
RDF. It is essentially a graph-pattern matching query language, composed
of three parts:
Management of Big Semantic Data 137

a. The pattern matching part, which includes the most basic features of
graph pattern matching, such as optional parts, union of patterns,
nesting, filtering values of possible matchings, and the possibility of
choosing the data source to be matched by a pattern.
b. The solution modifiers which, once the output of the pattern has been
computed (in the form of a table of values of variables), allow to
modify these values applying standard classical operators such as
projection, distinct, order, and limit.
c. Finally, the output of an SPARQL query comes in tree forms. (1) May
be: yes/no queries (ASK queries); (2) selections of values of the vari-
ables matching the patterns (SELECT queries), and (3) construction
of new RDF data from these values, and descriptions of resources
(CONSTRUCT queries).

An SPARQL query Q comprises head and body. The body is a complex


RDF graph pattern expression comprising triple patterns (e.g., RDF triples in
which each subject, predicate, or object may be a variable) with conjunctions,
disjunctions, optional parts, and constraints over the values of the variables.
The head is an expression that indicates how to construct the answer for Q.
The evaluation of Q against an RDF graph G is done in two steps: (i) the body
of Q is matched against G to obtain a set of bindings for the variables in the
body and then (ii) using the information on the head, these bindings are
processed applying classical relational operators (projection, distinct, etc.) to
produce the answer Q.

Web of (Linked) Data


The WWW has enabled the creation of a global space comprising linked doc-
uments (Heath and Bizer 2011) that express information in a human-readable
way. All agree that the WWW has revolutionized the way we consume infor-
mation, but its document-oriented model prevents machines and automatic
agents for directly accessing to the raw data underlying to any web page con-
tent. The main reason is that documents are the atoms in the WWW model
and data lack of an identity within them. This is not a new story: an “univer-
sal database,” in which all data can be identified at world scale, is a cherished
dream in Computer Science.
The Web of Data (Bizer et  al. 2009) emerges under all previous consid-
erations in order to convert raw data into first class citizens of the WWW.
It materializes the Semantic Web foundations and enables raw data, from
diverse fields, to be interconnected within a cloud of data-to-data hyper-
links. It achieves a ubiquitous and seamless data integration to the lowest
138 Big Data Computing

level of granularity over the WWW infrastructure. It is worth noting that


this idea does not break with the WWW as we know. It only enhances the
WWW with additional standards that enable data and documents to coexist
in a common space. The Web of Data grows progressively according to the
Linked Data principles.

Linked Data
The Linked Data project* originated in leveraging the practice of linking data
to the semantic level, following the ideas of Berners-Lee (2006). Its authors
state that:

Linked Data is about using the WWW to connect related data that wasn’t
previously linked, or using the WWW to lower the barriers to linking
data currently linked using other methods. More specifically, Wikipedia
defines Linked Data as “a term used to describe a recommended best
practice for exposing, sharing, and connecting pieces of data, infor-
mation, and knowledge on the Semantic Web using URIs (Uniform
Resource Identifiers) and RDF.”

The idea is to leverage the WWW infrastructure to produce, publish, and


consume data (not only documents in the form of web pages). These pro-
cesses are done by different stakeholders, with different goals, in different
forms and formats, in different places. One of the main challenges is the
meaningful interlinking of this universe of data (Hausenblas and Karnstedt
2010). It relies on the following four rules:

1. Use URIs as names for things. This rule enables each possible real-
world entity or its relationships to be unequivocally identified at
universal scale. This simple decision guarantees that any raw data
has its own identity in the global space of the Web of Data.
2. Use HTTP URIs so that people can look up those names. This decision
leverages HTTP to retrieve all data related to a given URI.
3. When someone looks up a URI, provide useful information, using stan-
dards. It standardizes processes in the Web of Data and pacts the
languages spoken by stakeholders. RDF and SPARQL, together with
semantic technologies described in the previous section, defines the
standards mainly used in the Web of Data.
4. Include links to other URIs. It materializes the aim of data integration
by simply adding new RDF triples which link data from two dif-
ferent data sets. This inter-data set linkage enables the automatic
browsing.

* http://www.linkeddata.org
Management of Big Semantic Data 139

These four rules provide the basics for publishing and integrating Big
Semantic Data into the global space of the Web of Data. They enable raw data
to be simply encoded by combining the RDF model and URI-based identi-
fication, both for entities and for their relationships adequately labeled over
rich semantic vocabularies. Berners-Lee (2002) expresses the Linked Data
relevance as follows:
Linked Data allows different things in different data sets of all kinds to be con-
nected. The added value of putting data on the WWW is given by the way it
can be queried in combination with other data you might not even be aware
of. People will be connecting scientific data, community data, social web
data, enterprise data, and government data from other agencies and organi-
zations, and other countries, to ask questions not asked before.
Linked data is decentralized. Each agency can source its own data without a big
cumbersome centralized system. The data can be stitched together at the edges,
more as one builds a quilt than the way one builds a nuclear power station.
A virtuous circle. There are many organizations and companies which will
be motivated by the presence of the data to provide all kinds of human access
to this data, for specific communities, to answer specific questions, often in
connection with data from different sites.
The project and further information about linked data can be found in
Bizer et al. (2009) and Heath and Bizer (2011).

Linked Open Data


Although Linked Data do not prevent its application in closed environ-
ments (private institutional networks on any class of intranet), the most vis-
ible example of adoption and application of its principles runs openly. The
Linked Open Data (LOD) movement set semantic data to be released under
open licenses which do not impede data reuse for free. Tim Berners-Lee also
devised a “five-star” test to measure how these Open Data implements the
Linked Data principles:

1. Make your stuff available on the web (whatever format).


2. Make it available as structured data (e.g., excel instead of image scan
of a table).
3. Use nonproprietary format (e.g., CSV instead of excel).
4. Use URLs to identify things, so that people can point at your stuff.
5. Link your data to other people’s data to provide context.

The LOD cloud has grown significantly since its origins in May 2007.* The first
report pointed that 12 data sets were part of this cloud, 45 were acknowledged
in September 2008, 95 data sets in 2009, 203 in 2010, and 295 different data sets

* http://richard.cyganiak.de/2007/10/lod/
140 Big Data Computing

in the last estimation (September 2011). These last statistics* point out that more
than 31 billion triples are currently published and more than 500 million links
establish cross-relations between data sets. Government data are predominant
in LOD, but other fields such as geography, life sciences, media, or publications
are also strongly represented. It is worth emphasizing the existence of many
cross-domain data sets comprising data from some diverse fields. These tend
to be hubs because providing data that may be linked from and to the vast
majority of specific data sets. DBpedia† is considered the nucleus for the LOD
cloud (Auer et al. 2007). In short, DBpedia gathers raw data underlying to the
Wikipedia web pages and exposes the resulting representation following the
Linked Data rules. It is an interesting example of Big Semantic Data, and its
management is considered within our experiments.

Stakeholders and Processes in Big Semantic Data


Although we identify data scientists as one of the main actors in the man-
agement of Big Semantic Data, we also unveil potential “traditional” users
when moving from a Web of documents to a Web of data, or, in this context,
to a Web of Big Semantic Data. The scalability problems arising for data
experts and general users cannot be the same, as these are supposed to man-
age the information under different perspectives. A data scientist can make
strong efforts to create novel semantic data or to analyze huge volumes of
data created by third parties. She can make use of data-intensive computing,
distributed machines, and algorithms, to spend several hours performing a
closure of a graph is perfectly accepted. In contrast, a common user retriev-
ing, for instance, all the movies shot in New York in a given year, expects
not an immediate answer, but a reasonable response time. Although one
could establish a strong frontier between data (and their problems) of these
worlds, we cannot forget that establishing and discovering links between
diverse data is beneficial for all parties. For instance, in life sciences it is
important to have links between the bibliographic data of publications and
the concrete genes studied in each publication, thus another researcher can
look up previous findings of the genes they are currently studying.
The concern here is to address specific management problems while
remaining in a general open representation and publication infrastructure
in order to leverage the full potential of Big Semantic Data. Under this prem-
ise, a first characterization of the involved roles and processes would allow
researchers and practitioners to clearly focus their efforts on a particular
area. This section provides an approach toward this characterization. We

* http://www4.wiwiss.fu-berlin.de/lodcloud/state/
† http://dbpedia.org
Management of Big Semantic Data 141

first establish a simple set of stakeholders in Big Semantic Data, from where
we define a common data workflow in order to better understand the main
processes performed in the Web of Data.

Participants and Witnesses


One of the main breakthroughs after the creation of the Web was the con-
sideration of the common citizen as the main stakeholder, that is, a part
involved not only in the consumption, but also in the creation of content. To
emphasize this fact, the notion of Web 2.0 was coined, and its implications
such as blogging, tagging, or social networking became one of the roots of
our current sociability.
The Web of Data can be considered as a complementary dimension to this
successful idea, which addresses the data set problems of the Web. It focused
on representing knowledge through machine-readable descriptions (i.e.,
RDF), using specific languages and rules for knowledge extraction and rea-
soning. How this could be achieved by the general audience, and exploited
for the general market, will determine its chances to success beyond the sci-
entific community.
To date, neither the creation of self-described semantic content nor the
linkage to other sources is a simple task for a common user. There exist sev-
eral initiatives to bring semantic data creation to a wider audience, being the
most feasible use of RDFa (Adida et al. 2012). Vocabulary and link discovery
can also be mitigated through searching and recommendation tools (Volz
et al. 2009; Hogan et al. 2011). However, in general terms, one could argue
that the creation of semantic data is still almost as narrow as the original
content creation in Web 1.0. In the LOD statistics, previously reported, only
0.42% of the total data is user generated. It means that public organizations
(governments, universities, digital libraries, etc.), researchers, and innovative
enterprises are the main creators, whereas citizens are, at this point, just wit-
nesses of a hidden increasingly reality.
This reality shows that these few creators are able to produce huge vol-
umes of RDF data, yet we will argue, in the next section, about the quality
of these publication schemes (in agreement with empirical surveys; Hogan
et  al. 2012). In what follows, we characterize a minimum set of stakehold-
ers interacting with this huge graph of knowledge with such an enormous
potential. Figure 4.1 illustrates the main identified stakeholders within Big
Semantic Data. Three main roles are present: creators, publishers, and consum-
ers, with an internal subdivision by creation method or intended use. In par-
allel, we distinguish between automatic stakeholders, supervised processes, and
human stakeholders. We define below each stakeholder, assuming that (i) this
classification may not be complete as it is intended to cover the minimum
foundations to understand the managing processes in Big Semantic Data
and (ii) categories are not disjoint; an actor could participate with several
roles in a real-world scenario.
142 Big Data Computing

Automatic Supervised Human


stakeholders processes stakeholders

• From scratch
• Conversion from other data format
Creator • Data integration from existing content

• Linked data compliant


Publisher

• Direct consumption
• Intensive consumer processing
Consumer • Composition of data

Figure 4.1
Stakeholder classification in Big Semantic Data management.

Creator: one that generates a new RDF data set by, at least, one of these
processes:

• Creation from scratch: the novel data set is not based on a previous
model. Even if the data exist beforehand, the data modeling process
is unbiased from the previous data format. RDF authoring tools* are
traditionally used.
• Conversion from other data format: the creation phase is highly deter-
mined by the conversion of the original data source; potential map-
pings between source and target data could be used; for example, from
relational databases (Arenas et  al. 2012), as well as ­(semi-)automatic
conversion tools.†
• Data integration from existing content: the focus moves to an efficient
integration of vocabularies and the validation of shared entities
(Knoblock et al. 2012).

Several tasks are shared among all three processes. Some examples of this
commonalities are the identification of the entities to be modeled (but this

* A list of RDF authoring tools can be found at http://www.w3.org/wiki/AuthoringToolsForRDF


† A list of RDF converters can be found at http://www.w3.org/wiki/ConverterToRdf
Management of Big Semantic Data 143

task is more important in the creation from scratch, as no prior ­identification


has been done) or the vocabulary reuse (crucial in data integration in which
different ontologies could be aligned). A complete description of the creation
process is out of the scope of this work (the reader can find a guide for Linked
Data creation in Heath and Bizer 2011).
Publisher: one that makes RDF data publicly available for different pur-
poses and users. From now on, let us suppose that the publisher follows the
Linked Data principles. We distinguish creators from publishers as, in many
cases, the roles can strongly differ. Publishers do not have to create RDF con-
tent but they are responsible of the published information, the availability of
the offered services (such as querying), and the correct adaptation to Linked
Data principles. For instance, a creator could be a set of sensors giving the
temperature in a given area in RDF (Atemezing et al. 2013), while the pub-
lisher is an entity who publish this information and provide entry points to
this information.
Consumer: one that makes use of published RDF data:

• Direct consumption: a process whose computation task mainly


involves the publisher, without intensive processing at the consumer.
Downloads of the total data set (or subparts), online querying, infor-
mation retrieval, visualization, or summarization are simple exam-
ples in which the computation is focused on the publisher.
• Intensive consumer processing: processes with a nonnegligible con-
sumer computation, such as offline analysis, data mining, or reason-
ing over the full data set or a subpart (live views; Tummarello et al.
2010).
• Composition of data: those processes integrating different data sources
or services, such as federated services over the Web of Data (Schwarte
et al. 2011; Taheriyan et al. 2012) and RDF snippets in search engines
(Haas et al. 2011).

As stated, we make an orthogonal classification of the stakeholders attend-


ing the nature of creators, publishers, and consumers. For instance, a sensor
could directly create RDF data, but it could also consume RDF data.
Automatic stakeholders, such as sensors, Web processes (crawlers, search
engines, recommender systems), RFID labels, smart phones, etc. Automatic
RDF streaming, for instance, would become a hot topic, especially within the
development of smart cities (De et al. 2012). Note that, although each piece of
information could be particularly small, the whole system can be seen also
as a big semantic data set.
Supervised processes, that is, processes with human supervision, as semantic
tagging and folksonomies within social networks (García-Silva et al. 2012).
Human stakeholders, who perform most of the task for creating, publishing,
or consuming RDF data.
144 Big Data Computing

The following running example provides a practical review of this


c­ lassification. Nowadays, an RFID tag could document a user context through
RDF metadata descriptions (Foulonneau 2011). We devise a system in which
RFID tags provide data about temperature and position. Thus, we have thou-
sands of sensors providing RDF excerpts modeling the temperature in dis-
tinct parts of a city. Users can visualize and query online this information,
establishing some relationships, for example, with special events (such as a
live concert or sport matches). In addition, the RDF can be consumed by a
monitoring system, for example, to alert the population in case of extreme
temperatures.
Following the classification, each sensor is an automatic creator, conform-
ing altogether a potentially huge volume of RDF data. While a sensor should
be designed to take care of RDF description (e.g., to follow a set of vocab-
ularies and description rules and to minimize the size of descriptions), it
cannot address publishing facilities (query endpoints, services to user, etc.).
Alternatively, intermediate hubs would collect the data and the authorita-
tive organization will be responsible of its publication, and applications and
services over these data. This publication authority would be considered as
a supervised process solving scalability issues of huge RDF data streams
for collecting the information, filtering it (e.g., eliminating redundancy), and
finally complying with Linked Data standards. Although these processes
could be automatic, let us suppose that human intervention is needed to
define links between data, for instance, linking positions to information
about city events. Note also that intermediate hubs could be seen as super-
vised consumers of the sensors, yet the information coming from the sensors
is not openly published but streamed to the appropriate hub. Finally, the
consumers are humans, in the case of the online users (concerned of query
resolution, visualization, summarization, etc.) or an automatic (or semiauto-
matic) process, in the case of monitoring (doing potential complex inference
or reasoning).

Workflow of Publication-Exchange-Consumption
The previous RFID network example shows the enormous diversity of pro-
cesses and different concerns for each type of stakeholder. In what follows,
we will consider the creation step out of the scope of this work, because our
approach relies on the existence of big RDF data sets (without belittling those
ones which can be created hereinafter). We focus on tasks involving large-scale
management; for instance, scalability issues of visual authoring a big RDF data
set are comparable to RDF visualization by consumers, or the performance of
RDF data integration from existing content depends on efficient access to the
data and thus existing indexes, a crucial issue also for query response.
Management processes for publishers and consumers are diverse and
complex to generalize. However, it is worth characterizing a common work-
flow present in almost every application in the Web of Data in order to place
Management of Big Semantic Data 145

l i s h e rs 3. Con
ub
sumpti
on
R/I
P

ge
chan Q/P
1. Publication Dereferenceable URls 2. Ex I

Reasoning/integration
RDF dump Quality/Provenance
APP
Sensor Indexing
SPARQL endpoints/ o APP

C
APIs ns
u m e rs

Figure 4.2
Publication-Exchange-Consumption workflow in the Web of Data.

scalability issues in context. Figure 4.2 illustrates the identified workflow of


Publication-Exchange-Consumption.
Publication refers to the process of making RDF data publicly available for
diverse purposes and users, following the Linked Data principles. Strictly,
the only obligatory “service” in these principles is to provide dereference-
able URIs, that is, related information of an entity. In practice, publishers
complete this basic functionality exposing their data through public APIs,
mainly via SPARQL endpoints, a service which interprets the SPARQL query
language. They also provide RDF dumps, files to fully or partly download
the RDF data set.
Exchange is the process of information exchange between publishers and
consumers. Although the information is represented in RDF, note that con-
sumers could obtain different “views” and hence formats, and some of
them not necessarily in RDF. For instance, the result of an SPARQL query
could be provided in a CSV file or the consumer would request a summary
with statistics of the data set in an XML file. As we are issuing manage-
ment of semantic data sets, we restrict exchange to RDF interchange. Thus,
we rephrase exchange as the process of RDF exchange between publishers
and consumers after an RDF dump request, an SPARQL query resolution or
another request or service provided by the publisher.
Consumption can involve, as stated, a wide range of processes, from direct
consumption to intensive processing and composition of data sources. Let us
simply define the consumption as the use of potentially large RDF data for
diverse purposes.
A final remark must be done. The workflow definition seems to restrict
the management to large RDF data sets. However, we would like to open
scalability issues to a wider range of publishers and consumers with more
limited resources. For instance, similar scalability problems arise when
managing RDF in mobile devices; although the amount of information could
be potentially smaller, these devices have more restrictive requirements for
transmission costs/latency, and for postprocessing due to their inherent
memory and CPU constraints (Le-Phuoc et al. 2010). In the following, when-
ever we provide approaches for managing these processes in large RDF data
sets, we ask the lecture to take into consideration this appreciation.
146 Big Data Computing

State of the Art for Publication-Exchange-Consumption


This section summarizes some of the current trends to address publication,
exchange, and consumption at large scale.
Publication schemes: the straightforward publication, following Linked
Data principles, presents several problems in large data sets (Fernández et al.
2010); a previous analysis of published RDF data sets reveals several undesir-
able features; the provenance and metadata about contents are barely pres-
ent, and their information is neither complete nor systematic. Furthermore,
the RDF dump files have neither internal structure nor a summary of their
content. A massive empirical study of Linked Open Data sets in Hogan
et al. (2012) draws similar conclusions; few providers attach human readable
metadata to their resources or licensing information. Same features can be
applied to SPARQL endpoints, in which a consumer knows almost nothing
about the content she is going to query beforehand. In general terms, except
for the general Linked Data recommendations (Heath and Bizer 2011), few
works address the publication of RDF at large scale.
The Vocabulary of Interlinked Data sets: VoiD (Alexander et  al. 2009) is
the nearest approximation to the discovery problem, providing a bridge
between publishers and consumers. Publishers make use of a specific vocab-
ulary to add metadata to their data sets, for example, to point to the asso-
ciated SPARQL endpoint and RDF dump, to describe the total number of
triples, and to connect to linked data sets. Thus, consumers can look up this
metadata to discover data sets or to reduce the set of interesting data sets in
federated queries over the Web of Data (Akar et al. 2012). Semantic Sitemaps
(Cyganiak et  al. 2008) extend the traditional Sitemap Protocol for describ-
ing RDF data. They include new XML tags so that crawling tools (such as
Sindice*) can discover and consume the data sets.
As a last remark, note that deferenceable URIs can be done in a straight-
forward way, publishing one document per URI or set of URIs. However, the
publisher commonly materializes the output by querying the data set at URI
resolution time. This moves the problem to the underneath RDF store, which
has also to deal with scalability problems (see “Efficient RDF Consumption”
below). The empirical study in Hogan et al. (2012) also confirmed that pub-
lishers often do not provide locally known inlinks in the dereferenced
response, which must be taken into account by consumers.
RDF Serialization Formats: as we previously stated, we focus on exchanging
large-scale RDF data (or smaller volumes in limited resources stakeholders).
Under this consideration, the RDF serialization format directly determines
the transmission costs and latency for consumption. Unfortunately, data
sets are currently serialized in plain and verbose formats such as RDF/XML
(Beckett 2004) or Notation3: N3 (Berners-Lee 1998), a more compact and read-
able alternative. Turtle (Beckett and Berners-Lee 2008) inherits N3 compact

* http://sindice.com/
Management of Big Semantic Data 147

ability adding interesting extra features, for example, abbreviated RDF data
sets. RDF/JSON (Alexander 2008) has the advantage of being coded in a lan-
guage easier to parse and more widely accepted in the programing world.
Although all these formats present features to “abbreviate” constructions,
they are still dominated by a document-centric and human-readable view
which adds an unnecessary overhead to the final data set representation.
In order to reduce exchange costs and delays on the network, universal
compressors (e.g., gzip) are commonly used over these plain formats. In
addition, specific interchange oriented representations may also be used. For
instance, the Efficient XML Interchange Format: EXI (Schneider and Kamiya
2011) may be used for representing any valid RDF/XML data set.
Efficient RDF Consumption: the aforementioned variety of consumer tasks
hinders to achieve a one-size-fits-all technique. However, some general con-
cerns can be outlined. In most scenarios, the performance is influenced by
(i) the serialization format, due to the overall data exchange time, and (ii) the
RDF indexing/querying structure. In the first case, if a compressed RDF has
been exchanged, a previous decompression must be done. In this sense, the
serialization format affects the consumption through the transmission cost,
but also with the easiness of parsing. The latter factor affects the consump-
tion process in different ways:

• For SPARQL endpoints and dereferenceable URIs materialization,


the response time depends on the efficiency of the underlying RDF
indexes at the publisher.
• Once the consumer has the data set, the most likely scenario is
indexing it in order to operate with the RDF graph, for example, for
intensive operation of inference, integration, etc.

Although the indexing at consumption could be performed once, the


amount of resources required for it may be prohibitive for many potential
consumers (especially for mobile devices comprising a limited computa-
tional configuration). In both cases, for publishers and consumers, an RDF
store indexing the data sets is the main actor for efficient consumption.
Diverse techniques provide efficient RDF indexing, but there are still work-
loads for scalable indexing and querying optimization (Sidirourgos et  al.
2008; Schmidt et al. 2010). On the one hand, some RDF stores are built over
relational databases and perform SPARQL queries through SQL, for example,
Virtuoso.* The most successful relational-based approach performs a vertical
partitioning, grouping triples by predicates, and storing them in indepen-
dent 2-column tables (S,O) (Sidirourgos et al. 2008; Abadi et al. 2009). On the
other hand, some stores: Hexastore (Weiss et al. 2008) or RDF-3X (Neumann
and Weikum 2010) build indices for all possible combinations of elements
in RDF (SPO, SOP, PSO, POS, OPS, OSP), allowing (i) all triple patterns to
* http://www.openlinksw.com/dataspace/dav/wiki/Main/VOSRDF
148 Big Data Computing

be directly resolved in the corresponding index and (ii) the first join step
to be resolved through fast merge-join. Although it achieves a global com-
petitive performance, the index replication largely increases spatial require-
ments. Other solutions take advantage of structural properties of the data
model (Tran et  al. 2012), introduce specific graph compression techniques
(Atre et al. 2010; Álvarez-García et al. 2011), or use distributed nodes within a
MapReduce infrastructure (Urbani et al. 2010).

An Integrated Solution for Managing Big Semantic Data


When dealing with Big Semantic Data, each step in the workflow must be
designed to address the three Big Data dimensions. While variety is man-
aged through semantic technologies, this decision determines the way vol-
ume and velocity are addressed. As previously discussed, data serialization
has a big impact on the workflow, as traditional RDF serialization formats
are designed to be human readable instead of machine processable. They
may fit smaller scenarios in which volume or velocity are not an issue, but
under the presented premises, it clearly becomes a bottleneck of the whole
process. We present, in the following, the main requirements for an RDF
serialization format of Big Semantic Data.

• It must be generated efficiently from another RDF input format. For


instance, a data creator having the data set in a semantic database
must be able to dump it efficiently into an optimized exchange
format.
• It must be space efficient. The generated dump should be as small as
possible, introducing compression for space savings. Bear in mind
that big semantic data sets are shared on the Web of Data and they
may be transferred through the network infrastructure to hundreds
or even thousands of clients. Reducing size will not only minimize
the bandwidth costs of the server, but also the waiting time of con-
sumers who are retrieving the data set for any class of consumption.
• It must be ready to post process. A typical case is performing a sequen-
tial triple-to-triple scanning for any post-processing task. This can
seem trivial, but is clearly time-consuming when Big Semantic Data
is postprocessed by the consumer. As shown in our experiments,
just parsing a data set of 640 million triples, serialized in NTriples
and gzip-compressed, wastes more than 40 min on a modern com-
putational configuration.
• It must be easy to convert to other representations. The most usual sce-
nario at consumption involves loading the data set into an RDF
Management of Big Semantic Data 149

Store. Most of the solutions reviewed in the previous section use


disk-resident variants of B-Trees, which keep a subset of the pages in
the main memory. For instance, if data are already sorted, this pro-
cess is more efficient than doing it on unsorted elements. Therefore,
having the data pre-sorted can be a step ahead in these cases. Also,
many stores keep several indices for the different triples orderings
(SPO, OPS, PSO, etc.). If the serialization format enables data travers-
ing to be performed in different orders, the multi-index generation
process can be completed more efficiently.
• It should be able to locate pieces of data within the whole data set. It is
desirable to avoid a full scan over the data set just to locate a par-
ticular piece of data. Note that this scan is a highly time-consuming
process in Big Semantic Data. Thus, the serialization format must
retain all possible clues, enabling direct access to any piece of data
in the data set. As explained in the SPARQL query language, a basic
way of specifying which triples to fetch is specifying a triple pattern
where each component is either a constant or a variable. A desirable
format should be ready to solve most of the combinations of triple
patterns (possible combinations of constants or variables in subject,
predicates, and objects). For instance, a typical triple pattern is to
provide a subject, leaving the predicate and object as variables (and
therefore the expected result). In such cases, we pretend to locate
all the triples that talk about a specific subject. In other words, this
requirement contains a succinct intention; data must be encoded in
such a way that “the data are the index.”

Encoding Big Semantic Data: HDT


Our approach, HDT: Header–Dictionary–Triples (Fernández et al. 2010), consid-
ers all of the previous requirements, addressing a machine-processable RDF
serialization format which enables Big Semantic Data to be efficiently man-
aged within the common workflows of the Web of Data. The format formalizes
a compact binary serialization optimized for storage or transmission over a
network. It is worth noting that HDT is described and proposed for standard-
ization as W3C Member Submission (Fernández et al. 2011). In addition, a suc-
cinct data structure has been proposed (Martínez-Prieto et al. 2012a) to browse
HDT-encoded data sets. This structure holds the compactness of such repre-
sentation and provides direct access to any piece of data as described below.
HDT organizes Big Semantic Data in three logical components (Header,
Dictionary, and Triples) carefully described to address not only RDF peculiari-
ties, but also considering how these data are actually used in the Publication-
Exchange-Consumption workflow.
Header. The Header holds, in plain RDF format, metadata describing a big
semantic data set encoded in HDT. It acts as an entry point for a consumer,
who can peek on certain key properties of the data set to have an idea of
150 Big Data Computing

its content, even before retrieving the whole data set. It enhances the VoID
Vocabulary (Alexander et al. 2009) to provide a standardized binary data set
description in which some additional HDT-specific properties are ­appended.*
The Header component comprises four distinct sections:

• Publication Metadata provides information about the publication act,


for instance, when was the data set generated, when was it made
public, who is the publisher, where is the associated SPARQL end-
point, etc. Many properties of this type are described using the pop-
ular Dublin Core Vocabulary.†
• Statistical Metadata provides statistical information about the data
set, such as the number of triples, the number of different subjects,
predicates, objects, or even histograms. For instance, this class of
metadata is very valuable or visualization software or federated
query evaluation engines.
• Format Metadata describes how Dictionary and Triples components
are encoded. This allows one to have different implementations or
representations of the same data in different ways. For instance, one
could prefer to have the triples in SPO order, whereas other applica-
tions might need it in OPS. Also the dictionary could apply a very
aggressive compression technique to minimize the size as much as
possible, whereas other implementation could be focused on query
speed and even include a full-text index to accelerate text searches.
These metadata enable the consumer for checking how an HDT-
encoded data set can be accessed in the data structure.
• Additional Metadata. Since the Header contains plain RDF, the pub-
lisher can enhance it using any vocabulary. It allows specific data
set/application metadata to be described. For instance, in life sci-
ences a publisher might want to describe, in the Header, that the
data set describes a specific class of proteins.

Since RDF enables data integration at any level, the Header component
ensures that HDT-encoded data sets are not isolated and can be intercon-
nected. For instance, it is a great tool for query syndication. A syndicated
query engine could maintain a catalog composed by the Headers of different
HDT-encoded data sets from many publishers and use it to know where to
find more data about a specific subject. Then, at query time, the syndicated
query engine can either use the remote SPARQL endpoint to query directly
the third-party server or even download the whole data set and save it in a
local cache. Thanks to the compact size of HDT-encoded data sets, both the
transmission and storage costs are highly reduced.

* http://www.w3.org/Submission/2011/SUBM-HDT-Extending-VoID-20110330/
† http://dublincore.org/
Management of Big Semantic Data 151

Dictionary. The Dictionary is a catalog comprising all the different terms


used in the data set, such as URIs, literals, and blank nodes. A unique identi-
fier (ID) is assigned to each term, enabling triples to be represented as tuples
of three IDs which, respectively, reference the corresponding terms in the
dictionary. This is the first step toward compression, since it avoids long
terms to be repeatedly represented. This way, each term occurrence is now
replaced by its corresponding ID, whose encoding requires less bits in the
vast majority of the cases. Furthermore, the catalog of terms within the dic-
tionary may be encoded in many advanced ways focused on boosting que-
rying or reducing size. A typical example is to use any kind of differential
compression for encoding terms sharing long prefixes, for example, URIs.
The dictionary is divided into sections depending on whether the term
plays subject, predicate, or object roles. Nevertheless, in semantic data, it
is quite common that a URI appears both as a subject in one triple and as
object on another. To avoid repeating those terms twice in the subjects and
in the objects sections, we can extract them into a fourth section called shared
Subject-Object.
Figure 4.3 depicts the 4-section dictionary organization and how IDs are
assigned to the corresponding terms. Each section is sorted lexicographically
and then correlative IDs are assigned to each term from 1 to n. It is worth not-
ing that, for subjects and objects, the shared Subject–Object section uses the
lower range of IDs; for example, if there are m terms playing interchangeably
as subject and object, all x IDs such that x < m belong to this shared section.
HDT allows one to use different techniques of dictionary representation.
Each one can handle its catalog of terms in different ways, but must always
implement these basic operations

• Locate (term): finds the term and returns its ID


• Extract (id): extracts the term associated to the ID
• NumElements (): returns the number of elements of the section

More advanced techniques might also provide these optional operations

• Prefix (p): finds all terms starting with the prefix p


• Suffix (s): finds all terms ending with the suffix s

1 1 1
Shared Predicates
|sh| |sh| |P|
Subjects
Objects
|S|
|O|

Figure 4.3
HDT dictionary organization into four sections.
152 Big Data Computing

• Substring (s): finds all the terms containing the substring s


• Regex (e): finds all strings matching the specified regular expression e

For instance, these advanced operations are very convenient when serv-
ing query suggestions to the user, or when evaluating SPARQL queries that
include REGEX filters.
We suggest a Front-Coding (Witten et al. 1999) based representation as the
most simple way of dictionary encoding. It has been successfully used in
many WWW-based applications involving URL management. It is a very sim-
ple yet effective technique based on differential compression. This technique
applies to lexicographically sorted dictionaries by dividing them into buckets
of b terms. By tweaking this bucket size, different space/time trade-offs can
be achieved. The first term in the bucket is explicitly stored and the remain-
ing b − 1 ones are encoded with respect to their precedent: the common prefix
length is first encoded and the remaining suffix is appended. More technical
details about these dictionaries are available in Brisaboa et al. (2011).
The work of Martínez-Prieto et al. (2012b) surveys the problem of encoding
compact RDF dictionaries. It reports that Front-Coding achieves a good perfor-
mance for a general scenario, but more advanced techniques can achieve bet-
ter compression ratios and/or handle directly complex operations. In any case,
HDT is flexible enough to support any of these techniques, allowing stake-
holders to decide which configuration is better for their specific purposes.
Triples. As stated, the Dictionary component allows spatial savings to be
achieved, but it also enables RDF triples to be compactly encoded, represent-
ing tuples of three IDs referring the corresponding terms in the Dictionary.
Thus, our original RDF graph is now transformed into a graph of IDs which
encoding can be carried out in a more optimized way.
We devise a Triples encoding that organizes internally the information
in a way that exploits graph redundancy to keep data compact. Moreover,
this encoding can be easily mapped into a data structure that allows basic
retrieval operations to be performed efficiently.
Triple patterns are the SPARQL query atoms for basic RDF retrieval. That
is, all triples matching a template (s, p, o) (where s, p, and o may be variables)
must be directly retrieved from the Triples encoding. For instance, in the
geographic data set Geonames,* the triple pattern below searches all the sub-
jects whose feature code (the predicate) is “P” (the object), a shortcode for
“country.” In other words, it asks about all the URIs representing countries:

? <http://www.geonames.org/ontology#featureCode>
   <http://www.geonames.org/ontology#P>

Thus, the Triples component must be able to retrieve the subject of all those
triples matching this pair of predicate and object.

* http://www.geonames.org
Management of Big Semantic Data 153

HDT proposes a Triples encoding named BitmapTriples (BT). This tech-


nique needs the triples to be previously sorted in a specific order, such as
subject–predicate–object (SPO). BT is able to handle all possible triple order-
ings, but we only describe the intuitive SPO order for explanation purposes.
Basically, BT transforms the graph into a forest containing as many trees as
different subjects are used in the data set, and these trees are then ordered by
subject ID. This way, the first tree represents all triples rooted by the subject
identified as 1, the second tree represents all triples rooted by the subject
identified as 2, and so on. Each tree comprises three levels: the root repre-
sents the subject, the second level lists all predicates related to the subject,
and finally the leaves organize all objects for each pair (subject, predicate).
Predicate and object levels are also sorted.

• All predicates related to the subject are sorted in an increasing way.


As Figure 4.4 shows predicates are sorted as {5, 6, 7} for the second
subject.
• Objects follow an increasing order for each path in the tree. That
is, objects are internally ordered for each pair (subject, predicate). As
Figure 4.4 shows the object 5 is listed first (because it is related to the
pair (2, 5)), then 1,3 (by considering that these are related to the pair
(2, 6)), and 4 is the last object because of its relation to (2, 7).

Each triple in the data set is now represented as a full path root-to-leave in
the corresponding tree. This simple reorganization reveals many interesting
features.

• The subject can be implicitly encoded given that the trees are sorted
by subject and we know the total number of trees. Thus, BT does not
perform a triples encoding, but it represents pairs (predicate, object).
This is an obvious spatial saving.
• Predicates are sorted within each tree. This is very similar to a well-
known problem: posting list encoding for Information Retrieval
(Witten et al. 1999; Baeza-Yates and Ribeiro-Neto 2011). This allows
applying many existing and optimized techniques to our problem.

ID-triples Subjects: Bitmap Triples 1 2


1 2
1 7 2 BP 1 0 1 0 0
1 8 4 Predicates:
2 5 4
Predicates: 7 8 5 6 7 SP 7 8 5 6 7

2 6 1
2 6 3 Objects: 2 4 4 1 3 4 Bo 1 1 1 1 0 1
Objects:
2 7 4 So 2 4 4 1 3 4

Figure 4.4
Description of Bitmap Triples.
154 Big Data Computing

Besides, efficient search within predicate lists is enabled by assum-


ing that the elements follow a known ordering.
• Objects are sorted within each path in the tree, so (i) these can be
effectively encoded and (ii) these can also be efficiently searched.

BT encodes the Triples component level by level. That is, predicate and
object levels are encoded in isolation. Two structures are used for predicates:
(i) an ID sequence (Sp) concatenates predicate lists following the tree order-
ing; (ii) a bitsequence (Bp) uses one bit per element in Sp: 1 bits mean that
this predicate is the first one for a given tree, whereas 0 bits are used for
the remaining predicates. Object encoding is performed in a similar way:
So concatenates object lists, and Bo tags each position in such way that 1 bits
represent the first object in a path, and 0 bits the remaining ones. The right
part of Figure 4.4 illustrates all these sequences for the given example.

Querying HDT-Encoded Data Sets: HDT-FoQ


An HDT-encode data set can be directly accessed once its components are
loaded into the memory hierarchy of any computational system. Nevertheless,
this can be tuned carefully by considering the volume of the data sets and
the retrieval velocity needed by specific applications. Thus, we require a data
structure that keeps the compactness of the encoding to load data at the higher
levels of the memory hierarchy. Data in faster memory always means faster
retrieval operations. We call this solution HDT-FoQ: HDT Focused on Querying.
Dictionary. The dictionary component must be able to be directly mapped
from the encoding to the computer because it must embed enough informa-
tion to resolve the basic operations previously described. Thus, this compo-
nent follows the idea of “the data are the index.” We invite interested readers
to review the paper of Brisaboa et al. (2011) for a more detailed description on
how dictionaries provide indexing capabilities.
Triples. The previously described BitmapTriples approach is easy to map
due to the simplicity of its encoding. Sequences Sp and So are loaded into two
integer arrays using, respectively, log(|P|) and log(|O|) bits per element. Bit
sequences can also be mapped directly, but in this case they are enhanced
with an additional small structure (González et al. 2005) that ensures con-
stant time resolution for some basic bit operations.
This simple idea enables efficient traversal of the Triples component. All
these algorithms are described in Martínez-Prieto et  al. (2012a), but we
review them in practice over the example in Figure 4.4. Let us suppose that
we ask for the existence of the triple (2, 6, 1). It implies that the retrieval
operation is performed over the second tree:

1. We retrieve the corresponding predicate list. It is the 2nd one in Sp


and it is found by simply locating where is the second 1 bit in Bp. In
this case P2 = 3, so the predicate list comprises all elements from Sp[2]
Management of Big Semantic Data 155

until the end (because no more this is the last 1 bit in Bp). Thus, the
predicate list is {5, 6, 7}.
2. The predicate 6 is searched in the list. We binary search it and
find that it is the second element in the list. Thus, it is at position
P2 + 2 − 1 = 3 + 2 − 1 = 4 in Sp so we are traversing the 4th path of the
forest.
3. We retrieve the corresponding object list. It is the 4th one in So. We
obtain it as before: firstly locate the fourth 1 bit in Bo:O4 = 4 and then
retrieve all objects until the next 1 bit. That is, the list comprises the
objects {1, 3}.
4. Finally, the object list is binary searched and locates the object 3 in
its first position. Thus, we are sure that the triple (2, 6, 1) exists in the
data set.

All triple patterns providing the subjects are efficiently resolved on vari-
ants of this process. Thus, the data structure directly mapped from the
encoding provides fast subject-based retrieval, but makes difficult access-
ing by predicate and object. Both can easily be accomplished with a limited
overhead on the space used by the original encoding. All fine-grain details
about the following decisions are also explained in Martínez-Prieto et  al.
(2012a).
Enabling access by predicate. This retrieval operation demands direct access
to the second level of the tree, so it means efficient access to the sequence Sp .
However, the elements of Sp are sorted by subject, so locating all predicate
occurrences demands a full scanning of this sequence and this result in a
poor response time.
Although accesses by predicate are uncommon in general (Arias et  al.
2011), some applications could require them (e.g., extracting all the informa-
tion described with a set of given predicates). Thus, we must address it by
considering the need of another data structure for mapping Sp. It must enable
efficient predicate locating but without degrading basic access because it is
used in all operations by subject. We choose a structure called wavelet tree.
The wavelet tree (Grossi et al. 2003) is a succinct structure which reorganizes a
sequence of integers, in a range [1, n], to provide some access operations to the
data in logarithmic time. Thus, the original Sp is now loaded as a wavelet tree,
not as an array. It means a limited additional cost (in space) which holds HDT
scalability for managing Big Semantic Data. In return, we can locate all predi-
cate occurrences in logarithmic time with the number of different predicates
used for modeling in the data set. In practice, this number is small and it means
efficient occurrence location within our access operations. It is worth noting
that to access to any position in the wavelet tree has also now a logarithmic cost.
Therefore, access by predicate is implemented by firstly performing an
occurrence-to-occurrence location, and for each one traversing the tree by
following comparable steps to than explained in the previous example.
156 Big Data Computing

Enabling access by object. The data structure designed for loading HDT-
encoded data sets, considering a subject-based order, is not suitable for doing
accesses by object. All the occurrence of an object are scattered throughout
the sequence So and we are not able to locate them unless we do sequential
scan. Furthermore, in this case a structure like the Wavelet Tree becomes
inefficient; RDF data sets usually have few predicates, but they contain many
different objects and logarithmic costs result in very expensive operation.
We enhance HDT-FoQ with an additional index (called O-Index), that is
responsible for solving accesses by object. This index basically gathers the
positions in where each object appears in the original So. Please note that
each leave is associated to a different triple, so given the index of an element
in the lower level, we can guess the predicate and subject associated by tra-
versing the tree upwards processing the bit sequences in a similar way than
that used for subject-based access.
In relative terms, this O-Index has a significant impact in the final HDT-
FoQ requirements because it takes considerable space in comparison to the
other data structures used for modeling the Triples component. However, in
absolute terms, the total size required by HDT-FoQ is very small in compari-
son to that required by the other competitive solutions in the state of the art.
All these results are analyzed in the next section.
Joining Basic Triple Patterns. All this infrastructure enables basic triple pat-
terns to be resolved, in compressed space, at higher levels of the hierarchy of
memory. As we show below, it guarantees efficient triple pattern resolution.
Although this kind of queries are massively used in practice (Arias et  al.
2011), the SPARQL core is defined around the concept of Basic Graph Pattern
(BGP) and its semantics to build conjunctions, disjunctions, and optional parts
involving more than a single triple pattern. Thus, HDT-FoQ must provide more
advanced query resolution to reach a full SPARQL coverage. At this moment, it
is able to resolve conjunctive queries by using specific implementations of the
well-known merge and index join algorithms (Ramakrishnan and Gehrke 2000).

Experimental Results
This section analyzes the impact of HDT for encoding Big Semantic Data
within the Publication-Exchange-Consumption workflow described in the
Web of Data. We characterize the publisher and consumer stakeholders of
our experiments as follows:

• The publisher is devised as an efficient agent implementing a


powerful computational configuration. It runs on an Intel Xeon
[email protected] GHz, hexa-core (6cores-12siblings: 2 thread per core),
96 GB DDR3@1066 Mhz.
Management of Big Semantic Data 157

• The consumer is designed on a conventional configuration because


it plays the role of any agent consuming RDF within the Web of
Data. It runs on an AMD-PhenomTM-II X4 [email protected] GHz, quad-core
(4cores-4siblings: 1thread per core), 8 GB DDR2@800 MHz.

The network is regarded as an ideal communication channel: free of errors


and any other external interferences. We assume a transmission speed of
2 Mbyte/s.
All our experiments are carried out over an heterogeneous data configura-
tion of many colors and flavors. We choose a variety of real-world seman-
tic data sets of different sizes and from different application domains (see
Table 4.1). In addition, we join together the three bigger data sets into a large
mash-up of more than 1 billion triples to analyze performance issues in an
integrated data set.
The prototype running these experiments is developed in C ++ using the
HDT library publicly available at the official RDF/HDT website.*

Publication Performance
As explained, RDF data sets are usually released in plain-text form (NTriples,
Turtle, or RDF-XML), and their big volume is simply reduced using any tradi-
tional compressor. This way, volume directly affects the publication process
because the publisher must, at least, process the data set to convert it to a suit-
able format for exchange. Attending to the current practices, we set gzip com-
pression as the baseline and we also include lzma because of its effectiveness.
We compare their results against HDT, in plain and also in conjunction with
the same compressors. That is, HDT plain implements the encoding described
in section “Encoding Big Semantic Data: HDT”, and HDT + X stands for the
result of compressing HDT plain with the compressor X.

Table 4.1
Statistics of the Real-World Data sets Used in the Experimentation
Plain Size
Data set Ntriples (GB) Available at
LinkedMDB 6,148,121 0,85 http://queens.db.toronto.edu/~oktie/linkedmdb
DBLP 73,226,756 11,16 http://DBLP.l3s.de/DBLP++.php
Geonames 119,316,724 13,79 http://download.Geonames.org/all-Geonames-rdf.zip
DBpedia 296,907,301 48,62 http://wiki.dbpedia.org/Downloads37
Freebase 639,078,932 84,76 http://download.freebase.com/datadumps/a
Mashup 1,055,302,957 140,46 Mashup of Geonames + Freebase + DBPedia
a Dump on 2012-07-26 converted to RDF using http://code.google.com/p/freebase-quad-rdfize/.

* http://www.rdfhdt.org
158 Big Data Computing

Figure 4.5 shows compression ratios for all the considered techniques.
In general, HDT plain requires more space than traditional compressors.
It is an expected result because both Dictionary and Triples use very basic
approaches. Advanced techniques for each component enable signifi-
cant improvements in space. For instance, our preliminary results using
the technique proposed in Martínez-Prieto et  al. (2012b) for dictionary
encoding show a significant improvement in space. Nevertheless, if we
apply traditional compression over the HDT-encoded data sets, the spa-
tial requirements are largely diminished. As shown in Figure 4.5, the com-
parison changes when the HDT-encoded data sets are compressed with
gzip and lzma. These results show that HDT + lzma achieves the most
compressed representations, largely improving the effectiveness reported
by traditional approaches. For instance, HDT + lzma only uses 2.56% of
the original mash-up size, whereas compressors require 5.23% (lzma) and
7.92% (gzip).
Thus, encoding the original Big Semantic Data with HDT and then apply-
ing compression reports the best numbers for publication. It means that
publishers using our approach require 2−3 times less storage space and
bandwidth than using traditional compression. These savings are achieved
at the price of spending some time to obtain the corresponding representa-
tions. Note that traditional compression basically requires compressing the
data set, whereas our approach firstly transforms the data set into its HDT
encoding and then compresses it. These publication times (in minutes) are
depicted in Table 4.2.

Compression ratio
0.00 2.00 4.00 6.00 8.00 10.00 12.00
1.43
1.81
LinkedMDB 2.20
4.46
6.34
1.56
2.01
DBLP 2.87
4.10 HDT + LZMA
6.68 HDT + gz
1.68 NT + Izma
2.16
Geonames 2.47
4.97
NT + gz
8.33 HDT
3.34
4.32
DBPedia 6.67
9.73
11.25
2.06
2.63
Freebase 4.40
6.65
6.24
2.56
3.32
Mashup 5.23
7.92
8.70

Figure 4.5
Dataset compression (expressed as percent of the original size in NTriples).
Management of Big Semantic Data 159

Table 4.2
Publication Times (Minutes)
HDT+
Data set gzip lzma gzip lzma
LinkedMDB 0.19 14.71 1.09 1.52
DBLP 2.72 103.53 13.48 21.99
Geonames 3.28 244.72 26.42 38.96
DBPedia 18.90 664.54 84.61 174.12
Freebase 24.08 1154.02 235.83 315.34
Mash-up 47.23 2081.07 861.87 1033.0
Note: Bold values emphasize the best compression
times.

As can be seen, direct publication, based on gzip compression, is up to 20


times faster than HDT + gzip. The difference is slightly higher compared to
HDT + lzma, but this choice largely outperforms direct lzma compression.
However, this comparison must be carefully analyzed because publication is
a batch process and it is performed only once per data set, whereas exchange
and postprocessing costs are paid each time that any consumer retrieves the
data set. Thus, in practical terms, publishers will prioritize compression ver-
sus publication time because: (i) storage and bandwidth savings and (ii) the
overall time that consumers wait when they retrieve the data set.

Exchange Performance
In the ideal network regarded in our experiments, exchange performance is
uniquely determined by the data size. Thus, our approach also appears as
the most efficient because of its excellent compression ratios. Table 4.3 orga-
nizes processing times for all data sets and each task involved in the work-
flow. Column exchange lists exchanging times required when lzma (in the
baseline) and HDT + lzma are used for encoding.
For instance, the mash-up exchange takes roughly half an hour for
HDT + lzma and slightly more than 1 h for lzma. Thus, our approach reduces
by the half exchange time and also saves bandwidth in the same proportion
for the mash-up.

Consumption Performance
In the current evaluation, consumption performance is analyzed from two com-
plementary perspectives. First, we consider a postprocessing stage in which the
consumer decompresses the downloaded data set and then indexes it for local
consumption. Every consumption task directly relies on efficient query resolu-
tion, and thus, our second evaluation focuses on query evaluation performance.
160 Big Data Computing

Table 4.3
Overall Client Times (Seconds)
Data set Config. Exchange Decomp. Index Total
LinkedMDB Baseline 9.61 5.11 111.08 125.80
HDT 6.25 1.05 1.91 9.21
DBLP Baseline 164.09 70.86 1387.29 1622.24
HDT 89.35 14.82 16.79 120.96
Geonames Baseline 174.46 87.51 2691.66 2953.63
HDT 118.29 19.91 44.98 183.18
DBPedia Baseline 1659.95 553.43 7904.73 10118.11
HDT 832.35 197.62 129.46 1159.43
Freebase Baseline 1910.86 681.12 58080.09 60672.07
HDT 891.90 227.47 286.25 1405.62
Mashup Baseline 3757.92 1238.36 >24 h >24 h
HDT 1839.61 424.32 473.64 2737.57
Note: Bold values highlight the best times for each activity in the workflow. Baseline means
that the file is downloaded in NTriples format, compressed using lzma, and indexed
using RDF-3X. HDT means that the file is downloaded in HDT, compressed with lzma,
and indexed using HDT-FoQ.

Both postprocessing and querying tasks require an RDF store enabling


indexing and efficient SPARQL resolution. We choose three well-known
stores for fair comparison with respect to HDT-FoQ:

• RDF3X* was recently reported as the fastest RDF store (Huang et al.
2011).
• Virtuoso† is a popular store performing on relational infrastructure.
• Hexastore‡ is a well-known memory-resident store.

Postprocessing. As stated, this task involves decompression and indexing


in order to make queryable the compressed data set retrieved from the pub-
lisher. Table 4.3 also organizes post-processing times for all data sets. It is
worth noting that we compare our HDT + lzma against a baseline compris-
ing lzma decompression and RDF3X indexing because it reports the best
numbers. Cells containing “ >24 h” mean that the process was not finished
after 24 h. Thus, indexing the mash-up in our consumer is a very heavy task
requiring a lot of computational resources and also wasting a lot of time.
HDT-based postprocessing largely outperforms RDF3X for all original
data sets in our setup. HDT performs decompression and indexing from ≈ 25
(DBPedia) to 114 (Freebase) times faster than RDF3X. This situation is due to

* RDF3X is available at http://www.mpi-inf.mpg.de/~neumann/rdf3x/


† Virtuoso is available at http://www.openlinksw.com/
‡ Hexastore has been kindly provided by the authors.
Management of Big Semantic Data 161

two main reasons. On the one hand, HDT-encoded data sets are smaller than
its counterparts in NTriples and it improves decompression performance.
On the other hand, HDT-FoQ generates its additional indexing structures
(see section “Querying HDT-Encoded Data sets: HDT-FoQ”) over the origi-
nal HDT encoding, whereas RDF3X first needs parsing the data set and then
building their specific indices from scratch. Both features share an important
fact: the most expensive processing was already done in the server side and
HDT-encoded data sets are clearly better for machine consumption.
Exchange and post-processing times can be analyzed together and because
of it the total time than a consumer must wait until the data is able to be
efficiently used in any application. Our integrated approach, around HDT
encoding and data structures, completes all the tasks 8−43 times faster than
the traditional combination of compression and RDF indexing. It means,
for instance, that the configured consumer retrieves and makes queryable
Freebase in roughly 23 min using HDT, but it needs almost 17 h to complete
the same process over the baseline. In addition, we can see that indexing is
clearly the heavier task in the baseline, whereas exchange is the longer task
for us. However, in any case, we always complete exchange faster due to our
achievements in space.
Querying. Once the consumer has made the downloaded data queryable,
the infrastructure is ready to build on-top applications issuing SPARQL que-
ries. The data volume emerges again as a key factor because it restricts the
ways indices and query optimizers are designed and managed.
On the one hand, RDF3X and Virtuoso rely on disk-based indexes which
are selectively loaded into main memory. Although both are efficiently tuned
for this purpose, these I/O transfers result in very expensive operations that
hinder the final querying performance. On the other hand, Hexastore and
HDT-FoQ always hold their indices in memory, avoiding these slow accesses
to disk. Whereas HDT-FoQ enables all data sets in the setup to be managed
in the consumer configuration, Hexastore is only able to index the smaller
one, showing its scalability problems when managing Big Semantic Data.
We obtain two different sets of SPARQL queries to compare HDT-FoQ
against the indexing solutions within the state of the art. On the one hand,
5000 queries are randomly generated for each triple pattern. On the other
hand, we also generate 350 queries of each type of two-way join, subdivided
into two groups depending on whether they have a small or big amount of
intermediate results. All these queries are run over Geonames in order to
include both Virtuoso and RDF3X in the experiments. Note that, both classes
of queries are resolved without the need of query planning, hence the results
are clear evidence of how the different indexing techniques perform.
Figure 4.6 summarizes these querying experiments. The X-axis lists all
different queries: the left subgroup lists the triple patterns, and the right ones
represent all different join classes. The Y-axis means the number of times that
HDT-FoQ is faster than its competitors. For instance, in the pattern (S, V, V)
(equivalent to dereference the subject S), HDT-FoQ is more than 3 times
162 Big Data Computing

Query evaluation time


16
15
14
13
12
Times HDT is better

11
10
9 RDF-3x
8
7 Virtuoso
6
5
4
3
2
1
0
spV sVo sVV Vpo VpV VVo SSbig SSsmall SObig SOsmall OObig OOsmall

Type of query (triple patterns and joins)

Figure 4.6
Comparison on querying performance on Geonames.

faster than RDF3X and more than 11 times faster than Virtuoso. In general,
HDT-FoQ always outperforms Virtuoso, whereas RDF3X is slightly faster for
(V, P, V), and some join classes. Nevertheless, we remain competitive in all
theses cases and our join algorithms are still open for optimization.

Conclusions and Next Steps


This chapter presents basic foundations of Big Semantic Data management.
First, we trace a route from the current data deluge, the concept of Big Data,
and the need of machine-processable semantics on the WWW. The Resource
Description Framework (RDF) and the Web of (Linked) Data naturally
emerge in this well-grounded scenario. The former, RDF, is the natural codi-
fication language for semantic data, combining the flexibility of semantic
networks with a graph data structure that makes it an excellent choice for
describing metadata at Web Scale. The latter, the Web of (Linked) Data, pro-
vides a set of rules to publish and link Big Semantic Data.
We justify the different and various management problems arising in Big
Semantic Data by characterizing their main stakeholders by role (Creators/
Publishers/Consumers) and nature (Automatic/Supervised/Human). Then,
we define a common workflow Publication-Exchange-Consumption, exist-
ing in most applications in the Web of Data. The scalability problems arising
to the current state-of-the-art management solutions within this scenario set
the basis of our integrated proposal HDT, based on the W3C standard RDF.
Management of Big Semantic Data 163

HDT is designed as a binary RDF format to fulfill the requirements of


portability (from and to other formats), compact ability, parsing efficiency
(readiness for postprocessing), and direct access to any piece of data in the
data set. We detail the design of HDT and we argue that HDT-encoded data
sets can be directly consumed within the presented workflow. We show that
lightweight indices can be created once the different components are loaded
into the memory hierarchy at the consumer, allowing for more complex oper-
ations such as joining basic SPARQL Triple Patterns. Finally, this compact
infrastructure, called HDT-FoQ (HDT Focused on Querying) is evaluated
toward a traditional combination of universal compression (for exchanging)
and RDF indexing (for consumption).
Our experiments show how HDT excels at almost every stage of the
publish-exchange-consumption workflow. The publisher spends a bit more
time to encode the Big Semantic data set, but in return, the consumer is
able to retrieve it twice as fast, and the indexing time is largely reduced to
just a few minutes for huge data sets. Therefore, the time since a machine
or human client discovers the data set until she is ready to start querying
its content is reduced up to 16 times by using HDT instead of the tradi-
tional approaches. Furthermore, the query performance is very competi-
tive compared to state-of-the art RDF stores, thanks to the size reduction
the machine can keep a vast amount of triples in main memory, avoiding
slow I/O transferences.
There are several areas where HDT can be further exploited. We foresee
a huge potential of HDT to support many aspects of the workflow Publish-
Exchange-Consume. HDT-based technologies can emerge to provide sup-
porting tools for both publishers and consumers. For instance, a very useful
tool for a publisher is setting up an SPARQL endpoint on top of an HDT file.
As the experiments show, HDT-FoQ is very competitive on queries, but there
is still plenty of room for SPARQL optimization, by leveraging efficient reso-
lution of triple patterns, joins, and query planning. Another useful tool for
publishers is configuring a dereferenceable URI materialization from a given
HDT. Here the experiments also show that performance will be very high
because HDT-FoQ is really fast on queries with a fixed RDF subject.

Acknowledgments
This work was partially funded by MICINN (TIN2009-14009-C02-02); Science
Foundation Ireland: Grant No. ~ SFI/08/CE/I1380, Lion-II; Fondecyt 1110287
and Fondecyt 1-110066. The first author is granted by Erasmus Mundus, the
Regional Government of Castilla y León (Spain) and the European Social
Fund. The third author is granted by the University of Valladolid: pro-
gramme of Mobility Grants for Researchers (2012).
164 Big Data Computing

References
Abadi, D., A. Marcus, S. Madden, and K. Hollenbach. 2009. SW-Store: A vertically
partitioned DBMS for Semantic Web data management. The VLDB Journal 18,
385–406.
Adida, B., I. Herman, M. Sporny, and M. Birbeck (Eds.). 2012. RDFa 1.1 Primer. W3C
Working Group Note. http://www.w3.org/TR/xhtml-rdfa-primer/.
Akar, Z., T. G. Hala, E. E. Ekinci, and O. Dikenelli. 2012. Querying the Web of
Interlinked Datasets using VOID Descriptions. In Proc. of the Linked Data on the
Web Workshop (LDOW), Lyon, France, Paper 6.
Alexander, K. 2008. RDF in JSON: A Specification for serialising RDF in JSON. In
Proc. of the 4th Workshop on Scripting for the Semantic Web (SFSW), Tenerife,
Spain.
Alexander, K., R. Cyganiak, M. Hausenblas, and J. Zhao. 2009. Describing linked
datasets-on the design and usage of voiD, the “vocabulary of interlinked data-
sets”. In Proc. of the Linked Data on the Web Workshop (LDOW), Madrid, Spain,
Paper 20.
Álvarez-García, S., N. Brisaboa, J. Fernández, and M. Martínez-Prieto. 2011.
Compressed k2-triples for full-in-memory RDF engines. In Proc. 17th Americas
Conference on Information Systems (AMCIS), Detroit, Mich, Paper 350.
Arenas, M., A. Bertails, E. Prud’hommeaux, and J. Sequeda (Eds.). 2012. A Direct
Mapping of Relational Data to RDF. W3C Recommendation. http://www.
w3.org/TR/rdb-direct-mapping/.
Arias, M., J. D. Fernández, and M. A. Martínez-Prieto. 2011. An empirical study of
real-world SPARQL queries. In Proc. of 1st Workshop on Usage Analyss and the Web
of Data (USEWOD), Hyderabad, India. http://arxiv.org/abs/1103.5043.
Atemezing, G., O. Corcho, D. Garijo, J. Mora, M. Poveda-Villalón, P. Rozas, D. Vila-
Suero, and B. Villazón-Terrazas. 2013. Transforming meteorological data into
linked data. Semantic Web Journal 4(3), 285–290.
Atre, M., V. Chaoji, M. Zaki, and J. Hendler. 2010. Matrix “Bit” loaded: A scalable
lightweight join query processor for RDF data. In Proc. of the 19th World Wide
Web Conference (WWW), Raleigh, NC, pp. 41–50.
Auer, S., C. Bizer, G. Kobilarov, J. Lehmann, and Z. Ives. 2007. Dbpedia: A nucleus
for a web of open data. In Proc. of the 6th International Semantic Web Conference
(ISWC), Busan, Korea, pp. 11–15.
Baeza-Yates, R. and B. A. Ribeiro-Neto. 2011. Modern Information Retrieval—the
Concepts and Technology Behind Search (2nd edn.). Pearson Education Ltd.
Beckett, D. (Ed.) 2004. RDF/XML Syntax Specification (Revised). W3C Recommendation.
http://www.w3.org/TR/rdf-syntax-grammar/.
Beckett, D. and T. Berners-Lee. 2008. Turtle—Terse RDF Triple Language. W3C Team
Submission. http://www.w3.org/TeamSubmission/turtle/.
Berners-Lee, T. 1998. Notation3. W3C Design Issues. http://www.w3.org/
DesignIssues/Notation3.
Berners-Lee, T. 2002. Linked Open Data. What is the idea? http://www.thenational-
dialogue.org/ideas/linked-open-data (accessed October 8, 2012).
Berners-Lee, T. 2006. Linked Data: Design Issues. http://www.w3.org/DesignIssues/
LinkedData.html (accessed October 8, 2012).
Management of Big Semantic Data 165

Bizer, C., T. Heath, and T. Berners-Lee. 2009. Linked data—the story so far. International
Journal on Semantic Web and Information Systems 5, 1–22.
Brickley, D. 2004. RDF Vocabulary Description Language 1.0: RDF Schema. W3C
Recommendation. http://www.w3.org/TR/rdf-schema/.
Brisaboa, N., R. Cánovas, F. Claude, M. Martínez-Prieto, and G. Navarro. 2011.
Compressed string dictionaries. In Proc. of 10th International Symposium on
Experimental Algorithms (SEA), Chania, Greece, pp. 136–147.
Cukier, K. 2010. Data, data everywhere. The Economist (February, 25). http://www.
economist.com/opinion/displaystory.cfm?story_id=15557443 (accessed October
8, 2012).
Cyganiak, R., H. Stenzhorn, R. Delbru, S. Decker, and G. Tummarello. 2008. Semantic
sitemaps: Efficient and flexible access to datasets on the semantic web. In Proc.
of the 5th European Semantic Web Conference (ESWC), Tenerife, Spain, pp. 690–704.
De, S., T. Elsaleh, P. M. Barnaghi, and S. Meissner. 2012. An internet of things platform
for real-world and digital objects. Scalable Computing: Practice and Experience
13(1), 45–57.
Dijcks, J.-P. 2012. Big Data for the Enterprise. Oracle (white paper) (January). http://
www.oracle.com/us/products/database/big-data-for-enterprise-519135.pdf
(accessed October 8, 2012).
Dumbill, E. 2012a. Planning for Big Data. O’Reilly Media, Sebastopol, CA.
Dumbill, E. 2012b. What is big data? Strata (January, 11). http://strata.oreilly.
com/2012/01/what-is-big-data.html (accessed October 8, 2012).
Fernández, J. D., M. A. Martínez-Prieto, and C. Gutiérrez. 2010. Compact represen-
tation of large RDF data sets for publishing and exchange. In Proc. of the 9th
International Semantic Web Conference (ISWC), Shangai, China, pp. 193–208.
Fernández, J. D., M. A. Martínez-Prieto, C. Gutiérrez, and A. Polleres. 2011. Binary RDF
Representation for Publication and Exchange (HDT). W3C Member Submission.
http://www.w3.org/Submission/2011/03/.
Foulonneau, M. 2011. Smart semantic content for the future internet. In Metadata and
Semantic Research, Volume 240 of Communications in Computer and Information
Science, pp. 145–154. Springer, Berlin, Heidelberg.
García-Silva, A., O. Corcho, H. Alani, and A. Gómez-Pérez. 2012. Review of the state
of the art: Discovering and associating semantics to tags in folksonomies. The
Knowledge Engineering Review 27(01), 57–85.
González, R., S. Grabowski, V. Mäkinen, and G. Navarro. 2005. Practical implementa-
tion of rank and select queries. In Proc. of 4th International Workshop Experimental
and Efficient Algorithms (WEA), Santorini Island, Greece, pp. 27–38.
Grossi, R., A. Gupta, and J. Vitter. 2003. High-order entropy-compressed text indexes.
In Proc. of 9th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA),
Baltimore, MD, pp. 841–850.
Haas, K., P. Mika, P. Tarjan, and R. Blanco. 2011. Enhanced results for web search. In
Proc. of the 34th International Conference on Research and Development in Information
Retrieval (SIGIR), Beijing, China, pp. 725–734.
Halfon, A. 2012. Handling big data variety. http://www.finextra.com/community/
fullblog. aspx?blogid = 6129 (accessed October 8, 2012).
Hausenblas, M. and M. Karnstedt. 2010. Understanding linked open data as a web-
scale database. In Proc. of the 1st International Conference on Advances in Databases
(DBKDA), 56–61.
166 Big Data Computing

Heath, T. and C. Bizer. 2011. Linked Data: Evolving the Web into a Global Data Space.
Synthesis Lectures on the Semantic Web: Theory and Technology, Morgan &
Claypool.
Hey, T., S. Tansley, and K. M. Tolle. 2009. Jim Gray on eScience: A transformed scien-
tific method. In The Fourth Paradigm. Microsoft Research.
Hogan, A., A. Harth, J. Umbrich, S. Kinsella, A. Polleres, and S. Decker. 2011. Searching
and browsing linked data with SWSE: The semantic web search engine. Journal
of Web Semantics 9(4), 365–401.
Hogan, A., J. Umbrich, A. Harth, R. Cyganiak, A. Polleres, and S. Decker. 2012. An
empirical survey of linked data conformance. Web Semantics: Science, Services
and Agents on the World Wide Web 14(0), 14–44.
Huang, J., D. Abadi, and K. Ren. 2011. Scalable SPARQL querying of large RDF
graphs. Proceedings of the VLDB Endowment 4(11), 1123–1134.
Knoblock, C. A., P. Szekely, J. L. Ambite, S. Gupta, A. Goel, M. Muslea, K. Lerman,
and P. Mallick. 2012. Semi-Automatically Mapping Structured Sources into
the Semantic Web. In Proc. of the 9th Extended Semantic Web Conference (ESWC),
Heraklion, Greece, pp. 375–390.
Le-Phuoc, D., J. X. Parreira, V. Reynolds, and M. Hauswirth. 2010. RDF On the Go:
An RDF Storage and Query Processor for Mobile Devices. In Proc. of the 9th
International Semantic Web Conference (ISWC), Shangai, China. http://ceur-ws.
org/Vol-658/paper503.pdf.
Lohr, S. 2012. The age of big data. The New York Times (February, 11). http://www.
nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-worl
d.html (accessed October 8, 2012).
Loukides, M. 2012. What is Data Science? O’Reilly Media.
Manola, F. and E. Miller (Eds.). 2004. RDF Primer. W3C Recommendation. www.
w3.org/TR/rdf-primer/.
Martínez-Prieto, M., M. Arias, and J. Fernández. 2012a. Exchange and consumption
of huge RDF data. In Proc. of the 9th Extended Semantic Web Conference (ESWC),
Heraklion, Greece, pp. 437–452.
Martínez-Prieto, M., J. Fernández, and R. Cánovas. 2012b. Querying RDF dictionar-
ies in compressed space. ACM SIGAPP Applied Computing Reviews 12(2), 64–77.
Marz, N. and J. Warren. 2013. Big Data: Principles and Best Practices of Scalable Realtime
Data Systems. Manning Publications.
McGuinness, D. L. and F. van Harmelen (Eds.). 2004. OWL Web Ontology Language
Overview. W3C Recommendation. http://www.w3.org/TR/owl-features/.
Neumann, T. and G. Weikum. 2010. The RDF-3X engine for scalable management of
RDF data. The VLDB Journal 19(1), 91–113.
Prud’hommeaux, E. and A. Seaborne (Eds.). 2008. SPARQL Query Language for RDF.
http://www.w3.org/TR/rdf-sparql-query/. W3C Recommendation.
Quesada, J. 2008. Human similarity theories for the semantic web. In Proceedings of
the First International Workshop on Nature Inspired Reasoning for the Semantic Web,
Karlsruhe, Germany.
Ramakrishnan, R. and J. Gehrke. 2000. Database Management Systems. Osborne/
McGraw-Hill.
Schmidt, M., M. Meier, and G. Lausen. 2010. Foundations of SPARQL query opti-
mization. In Proc. of the 13th International Conference on Database Theory (ICDT),
Lausanne, Switzerland, pp. 4–33.
Management of Big Semantic Data 167

Schneider, J. and T. Kamiya (Eds.). 2011. Efficient XML Interchange (EXI) Format 1.0.
W3C Recommendation. http://www.w3.org/TR/exi/.
Schwarte, A., P. Haase, K. Hose, R. Schenkel, and M. Schmidt. 2011. FedX: Optimization
techniques for federated query processing on linked data. In Proc. of the 10th
International Conference on the Semantic Web (ISWC), Bonn, Germany, pp. 601–616.
Selg, E. 2012. The next Big Step—Big Data. GFT Technologies AG (technical report).
http://www.gft.com/etc/medialib/2009/downloads/techreports/2012.
Par.0001.File. tmp/gft_techreport_big_data.pdf (accessed October 8, 2012).
Sidirourgos, L., R. Goncalves, M. Kersten, N. Nes, and S. Manegold. 2008. Column-
store Support for RDF Data Management: not All Swans are White. Proc. of the
VLDB Endowment 1(2), 1553–1563.
Taheriyan, M., C. A. Knoblock, P. Szekely, and J. L. Ambite. 2012. Rapidly integrating
services into the linked data cloud. In Proc. of the 11th International Semantic Web
Conference (ISWC), Boston, MA, pp. 559–574.
Tran, T., G. Ladwig, and S. Rudolph. 2012. Rdf data partitioning and query processing
using structure indexes. IEEE Transactions on Knowledge and Data Engineering 99.
Doi: ieeecomputersociety.org/10.1109/TKDE.2012.134
Tummarello, G., R. Cyganiak, M. Catasta, S. Danielczyk, R. Delbru, and S. Decker.
2010. Sig.ma: Live views on the web of data. Web Semantics: Science, Services and
Agents on the World Wide Web 8(4), 355–364.
Urbani, J., J. Maassen, and H. Bal. 2010. Massive semantic web data compression with
MapReduce. In Proc. of the 19th International Symposium on High Performance
Distributed Computing (HPDC) 2010, Chicago, IL, pp. 795–802.
Volz, J., C. Bizer, M. Gaedke, and G. Kobilarov. 2009. Discovering and maintaining
links on the web of data. In Proc. of the 9th International Semantic Web Conference
(ISWC), Shanghai, China, pp. 650–665.
Weiss, C., P. Karras, and A. Bernstein. 2008. Hexastore: Sextuple indexing for semantic
web data management. Proc. of the VLDB Endowment 1(1), 1008–1019.
Witten, I. H., A. Moffat, and T. C. Bell. 1999. Managing Gigabytes: Compressing and
Indexing Documents and Images. San Francisco, CA, Morgan Kaufmann.
This page intentionally left blank
5
Linked Data in Enterprise Integration

Sören Auer, Axel-Cyrille Ngonga Ngomo, Philipp Frischmuth,


and Jakub Klimek

Contents
Introduction.......................................................................................................... 169
Challenges in Data Integration for Large Enterprises.................................... 173
Linked Data Paradigm for Integrating Enterprise Data................................. 178
Runtime Complexity........................................................................................... 180
Preliminaries.................................................................................................... 181
3
The HR Algorithm......................................................................................... 183
Indexing Scheme......................................................................................... 183
Approach..................................................................................................... 184
Evaluation........................................................................................................ 187
Experimental Setup.................................................................................... 187
Results.......................................................................................................... 188
Discrepancy........................................................................................................... 191
Preliminaries.................................................................................................... 193
CaRLA............................................................................................................... 194
Rule Generation.......................................................................................... 194
Rule Merging and Filtering....................................................................... 195
Rule Falsification........................................................................................ 196
Extension to Active Learning........................................................................ 197
Evaluation........................................................................................................ 198
Experimental Setup.................................................................................... 198
Results and Discussion.............................................................................. 199
Conclusion............................................................................................................ 201
References.............................................................................................................. 202

Introduction
Data integration in large enterprises is a crucial but at the same time a costly,
long-lasting, and challenging problem. While business-critical information
is often already gathered in integrated information systems such as ERP,
CRM, and SCM systems, the integration of these systems themselves as well

169
170 Big Data Computing

as the integration with the abundance of other information sources is still


a major challenge. Large companies often operate hundreds or even thou-
sands of different information systems and databases. This is especially true
for large OEMs. For example, it is estimated that at Volkswagen there are
approximately 5000 different information systems deployed. At Daimler—
even after a decade of consolidation efforts—the number of independent IT
systems still reaches 3000.
After the arrival and proliferation of IT in large enterprises, various
approaches, techniques, and methods have been introduced in order to solve
the data integration challenge. In the last decade, the prevalent data integra-
tion approaches were primarily based on XML, Web Services, and Service-
Oriented Architectures (SOA) [9]. XML defines a standard syntax for data
representation, Web Services provide data exchange protocols, and SOA is a
holistic approach for distributed systems architecture and communication.
However, we become increasingly aware that these technologies are not suf-
ficient to ultimately solve the data integration challenge in large enterprises.
In particular, the overheads associated with SOA are still too high for rapid
and flexible data integration, which are a prerequisite in the dynamic world
of today’s large enterprises.
We argue that classic SOA architectures are well suited for transaction pro-
cessing, but more efficient technologies are available that can be deployed for
solving the data integration challenge. Recent approaches, for example, con-
sider ontology-based data integration, where ontologies are used to describe
data, queries, and mappings between them [33]. The problems of ontology-
based data integration are the required skills to develop the ontologies and
the difficulty to model and capture the dynamics of the enterprise. A related,
but slightly different approach is the use of the Linked Data paradigm for
integrating enterprise data. Similarly, as the data web emerged complement-
ing the document web, data intranets can complement the intranets and SOA
landscapes currently found in large enterprises.
The acquisition of Freebase by Google and Powerset by Microsoft are
the first indicators that large enterprises will not only use the Linked Data
paradigm for the integration of their thousands of distributed information
systems, but they will also aim at establishing Enterprise Knowledge Bases
(EKB; similar to what Freebase now is for Google) as hubs and crystallization
points for the vast amounts of structured data and knowledge distributed in
their data intranets.
Examples of public LOD data sources being highly relevant for large
enterprises are OpenCorporates* (a knowledge base containing information
about more than 50,000 corporations worldwide), LinkedGeoData [1] (a spa-
tial knowledge base derived from OpenStreetMap containing precise infor-
mation about all kinds of spatial features and entities) or Product Ontology†

* http://opencorporates.com/
† http://www.productontology.org/
Linked Data in Enterprise Integration 171

(which comprises detailed classifications and information about more


than 1 million products). For enterprises, tapping this vast, crowd-sourced
knowledge that is freely available on the web is an amazing opportunity.
However, it is crucial to assess the quality of such freely available knowl-
edge, to complement and contrast it with additional nonpublic information
being available to the enterprise (e.g., enterprise taxonomies, domain data-
bases, etc.) and to actively manage the life cycle of both—the public and
private data—being integrated and made available in an Enterprises data
intranet.
In order to make large enterprises ready for the service economy, their
IT infrastructure landscapes have to be made dramatically more flexible.
Information and data have to be integrated with substantially reduced costs
and in extremely short-time intervals. Mergers and acquisitions further
accelerate the need for making IT systems more interoperable, adaptive, and
flexible. Employing the Linked Data approach for establishing enterprise
data intranets and knowledge bases will facilitate the digital innovation
capabilities of large enterprises.
In this chapter, we explore the challenges large enterprises are still fac-
ing with regard to data integration. These include, but are not limited to,
the development, management, and interlinking of enterprise taxono-
mies, domain databases, wikis, and other enterprise information sources
(cf. “Challenges in Data Integration for Large Enterprises”). Employing
the Linked Data paradigm to address these challenges might result in the
emergence of enterprise Big Data intranets, where thousands of databases
and information systems are connected and interlinked. Only a small part
of the data sources in such an emerging Big Data intranet will actually be
the Big Data itself. Many of them are rather small- or medium-sized data
and knowledge bases. However, due to the large number of such sources,
they will jointly reach a critical mass (volume). Also, we will observe on
a data intranet a large semantic heterogeneity involving various schemas,
vocabularies, ontologies, and taxonomies (variety). Finally, since Linked
Data means directly publishing RDF from the original data representations,
changes in source databases and information systems will be immediately
visible on the data intranet and thus result in a constant evolution (velocity).
Of particular importance in such a Big Data intranet setting is the creation
of links between distributed data and knowledge bases within an enter-
prise’s Big Data intranet. Consequently, we also discuss the requirements
for linking and transforming enterprise data in depth (cf. “Linked Data
Paradigm for Integrating Enterprise Data”). Owing to the number of link-
ing targets to be considered and their size, the time efficiency of linking
is a key issue in Big Data intranets. We thus present and study the com-
plexity of the first reduction-ratio-optimal algorithm for link discovery (cf.
“Runtime Complexity”). Moreover, we present an approach for reducing the
discrepancy (i.e., improving the coherence) of data across knowledge bases
(cf. “Discrepancy”).
172 Big Data Computing

ERP
SCM
Database A
- Vocabularies
Taxonomy Database B
- Copies of
relevant LOD
CRM-US CRM-DE - Linksets
(internal/external)
Enterprise KB Wiki development
Intranet keyword
search
Portal DE
Enterprise data web
Portal US Wiki marketing

Enterprise IT systems landscape


Current situation (e.g. SOA) Firewall Our vision

Figure 5.1
​Our vision of an Enterprise Data Web (EDW). The solid lines show how IT systems may be
currently connected in a typical scenario. The dotted lines visualize how IT systems could be
interlinked employing an internal data cloud. The EDW also comprises an EKB, which consists
of vocabulary definitions, copies of relevant Linked Open Data, as well as internal and external
link sets between data sets. Data from the LOD cloud may be reused inside the enterprise, but
internal data are secured from external access just like in usual intranets.

The introductory section depicts our vision of an Enterprise Data Web and
the resulting semantically interlinked enterprise IT systems landscape (see
Figure 5.1). We expect existing enterprise taxonomies to be the nucleus of
linking and integration hubs in large enterprises, since these taxonomies
already reflect a large part of the domain terminology and corporate and
organizational culture. In order to transform enterprise taxonomies into
comprehensive EKBs, additional relevant data sets from the Linked Open
Data Web have to be integrated and linked with the internal taxonomies and
knowledge structures. Subsequently, the emerging EKB can be used (1) for
interlinking and annotating content in enterprise wikis, content management
systems, and portals; (2) as a stable set of reusable concepts and identifiers;
and (3) as the background knowledge for intranet, extranet, and site-search
applications. As a result, we expect the current document-oriented intranets
in large enterprises to be complemented with a data intranet, which facili-
tates the lightweight, semantic integration of the plethora of information sys-
tems and databases in large enterprises.
Linked Data in Enterprise Integration 173

Challenges in Data Integration for Large Enterprises


We identified six crucial areas (Table 5.1) where data integration challenges
arise in large enterprises. Figure 5.2 shows the Linked Data life cycle in con-
junction with the aforementioned challenges. Each challenge may be related
to a ­single or to multiple steps in the Linked Data life cycle.
Enterprise Taxonomies. Nowadays, almost every large enterprise uses tax-
onomies to provide a shared linguistic model aiming at structuring the large
quantities of documents, emails, product descriptions, enterprise directives,
etc. which are produced on a daily basis. Currently, terminology in large
enterprises is managed in a centralized manner mostly by a dedicated and
independently acting department (often referred to as Corporate Language
Management (CLM)). CLM is in charge of standardizing all corporate terms
both for internal and external uses. As a result, they create multiple diction-
aries for different scopes that are not interconnected. An employee who aims
at looking up a certain term needs to know which dictionary to use in that
very context, as well as where to retrieve the currently approved version of
it. The latter may not always be the case, especially for new employees. The
former applies to all employees, since it might be unclear, which dictionary
should be used, resulting in a complicated look-up procedure or worse the

Table 5.1
Overview of Data Integration Challenges Occurring in Large Enterprises
Information
Integration Domain Current State Linked Data Benefit
Enterprise Proprietary, centralized, no Open standards (e.g., SKOS),
Taxonomies relationships between distributed, hierarchical,
terms, multiple independent multilingual, reusable in other
terminologies (dictionaries) scenarios
XML Schema Multitude of XML schemas, Relationships between entities from
Governance no integrated different schemas, tracking/
documentation documentation of XML schema
evolution
Wikis Text-based wikis for teams or Reuse of (structured) information via
internal-use encyclopedias data wikis (by other applications),
interlinking with other data sources,
for example, taxonomies
Web Portal and Keyword search over textual Sophisticated search mechanisms
Intranet Search content employing implicit knowledge from
different data sources
Database Integration Data warehouses, schema Lightweight data integration through
mediation, query federation RDF layer
Enterprise Single Consolidated user No passwords, more sophisticated
Sign-On credentials, centralized SSO access control mechanisms (arbitrary
metadata attached to identities)
174 Big Data Computing

Inter-linking/
fusing

Manual/
Classification/
revision/
enrichment
authoring
Enterprise single sign-on

Wikis and Enterprise


portals taxonomies
Storage/
Quality analysis
querying
Web portal
XML schema
and intranet
governance
search

Database integration

Evolution/
Extraction
repair

Search/
browsing
exploration

Figure 5.2
​Linked data life cycle supports four crucial data integration challenges arising in enterprise
environments. Each of the challenges can relate to more than one lifecycle stage.

abandonment of a search at all. As a result, the main challenge in the area of


enterprise taxonomies is defragmentation of term definitions without cen-
tralization of taxonomy management. We propose to represent enterprise
taxonomies in RDF employing the standardized and widely used SKOS
[17] vocabulary as well as publishing term definitions via the Linked Data
principles.
XML Schema Governance. The majority of enterprises to date use XML for
message exchange, data integration, publishing, and storage, often in a form
of Web Services and XML databases. To be able to process XML documents
efficiently, it need to be known what kind of data can be expected to find
them. For this purpose, XML schemas should be presented for each XML
format used. XML schemas describe the allowed structure of an XML docu-
ment. Currently, there are four widespread languages for describing XML
Linked Data in Enterprise Integration 175

schemas: the oldest and the simplest DTD [3], the popular XML Schema
[31], the increasingly used Relax NG [4], and the rule-based Schematron
[12]. In a typical enterprise, there are hundreds or even thousands of XML
schemas in use, each possibly written in a different XML schema language.
Moreover, as the enterprise and its surrounding environment evolve, the
schemas need to adapt. Therefore, new versions of schemas are created,
resulting in a proliferation of XML schemas. XML schema governance now
is the process of bringing order into the large number of XML schemas
being generated and used within large organizations. The sheer number
of IT systems deployed in large enterprises that make use of the XML tech-
nology bear a challenge in bootstrapping and maintaining an XML schema
repository. In order to create such a repository, a bridge between XML sche-
mata and RDF needs to be established. This requires in the first place the
identification of XML schema resources and the respective entities that are
defined by them. Some useful information can be extracted automatically
from XML schema definitions that are available in a machine-readable for-
mat, such as XML schemas and DTDs. While this is probably given for
systems that employ XML for information exchange, it may not always be
the case in proprietary software systems that employ XML only for data
storage. In the latter case as well as for maintaining additional metadata
(such as responsible department, deployed IT systems, etc.), a substantial
amount of manual work is required. In a second step, the identified schema
metadata needs to be represented in RDF on a fine-grained level. The chal-
lenge here is the development of an ontology, which not only allows for the
annotation of XML schemas, but also enables domain experts to establish
semantic relationships between schemas. Another important challenge is to
develop methods for capturing and describing the evolution of XML sche-
mata, since IT systems change over time and those revisions need to be
aligned with the remaining schemas.
Wikis. These have become increasingly common through the last years
reaching from small personal wikis to the largest Internet encyclopedia
Wikipedia. The same applies for the use of wikis in enterprises [16] too. In
addition to traditional wikis, there is another category of wikis, which are
called semantic wikis. These can again be divided into two categories: seman-
tic text wikis and semantic data wikis. Wikis of this kind are not yet com-
monly used in enterprises, but crucial for enterprise data integration since
they make (at least some of) the information contained in a wiki machine-
accessible. Text-based semantic wikis are conventional wikis (where text is
still the main content type), which allow users to add some semantic annota-
tions to the texts (e.g., typed links). The semantically enriched content can
then be used within the wiki itself (e.g., for dynamically created wiki pages)
or can be queried, when the structured data are stored in a separate data
store. An example is Semantic MediaWiki [14] and its enterprise counterpart
SMW+ [25]. Since wikis in large enterprises are still a quite new phenom-
enon, the deployment of data wikis instead of or in addition to text wikis will
176 Big Data Computing

relatively be easy to tackle. A challenge, however, is to train the users of such


wikis to actually create semantically enriched information. For example, the
value of a fact can either be represented as a plain literal or as a relation to
another information resource. In the latter case, the target of the relation can
be identified either by a newly generated URI or one that was introduced
before (eventually already attached with some metadata). The more the users
are urged to reuse information wherever appropriate, the more all the par-
ticipants can benefit from the data. It should be part of the  design of the
wiki application (especially the user interface), to make it easy for users to
build quality knowledge bases (e.g., through autosuggestion of URIs within
authoring widgets). Data in RDF are represented in the form of simple state-
ments, information that naturally is intended to be stored in conjunction
(e.g., geographical coordinates) is not visible as such per se. The same applies
for information which users are accustomed to edit in a certain order (e.g.,
address data). A nonrational editing workflow, where the end-users are
confronted with a random list of property values, may result in invalid or
incomplete information. The challenge here is to develop a choreography of
authoring widgets in order to provide users with a more logical editing work-
flow. Another defiance to tackle is to make the deployed wiki systems avail-
able to as many stakeholders as possible (i.e., cross department boundaries)
to allow for an improved information reuse. Once Linked Data resources
and potentially attached information are reused (e.g., by importing such
data), it becomes crucial to keep them in synchronization with the original
source. Therefore, mechanisms for syndication (i.e., propagation of changes)
and synchronization need to be developed, both for intra- and extranet seman-
tic wiki resources. Finally, it is also necessary to consider access control in
this context. Semantic representations contain implicit information, which
can be revealed by inferencing and reasoning. A challenge is to develop and
deploy scalable access control mechanisms, which are aligned with existing
access control policies in the enterprise and which are safe with regard to the
hijacking of ontologies [10].
Web Portal and Intranet Search. The biggest problem with enterprise
intranets today is the huge difference in user experience when compared
to the Internet [18]. When using the Internet, the user is spoiled by mod-
ern technologies from, for example, Google or Facebook, which provide
very comfortable environments, precise search results, auto-complete text
boxes, etc. These technologies are made possible through large amounts of
resources invested in providing comfort for the millions of users, custom-
ers, beta testers, and by their large development team and also by the huge
number of documents available, which increases the chances that a user will
find what he is looking for. In contrast, in most enterprises, the intranet expe-
rience is often poor because the intranet uses technologies from the previ-
ous millennium. In order to implement search systems that are based on a
Linked Data approach and that provide a substantial benefit in comparison
with traditional search applications, the challenge of bootstrapping an initial
Linked Data in Enterprise Integration 177

set of high-quality RDF datasources needs to be tackled first. For example, as a


prerequisite for linking documents to terms a hierarchical taxonomy should
be created (see “Challenges in Data Integration for Large Enterprises”).
Mechanisms then need to be established to automatically create high-quality
links between documents and an initial set of terms (e.g., by crawling), since
it is not feasible to manually link the massive amount of available documents.
Furthermore, the process of semi-automatic linking of (a) terms that occur in
documents but are not part of the taxonomy yet (as well as their placement in
the taxonomy) and (b) terms that do not occur in documents but are related
and thus useful in a search needs to be investigated and suitable tools should
be developed to support responsible employees. To provide results beyond
those that can be obtained from text-based documents directly, other data
sets need to be transformed to RDF and queried. Finally, although a search
engine that queries RDF data directly works, it results in suboptimal per-
formance. The challenge here is to develop methods for improving perfor-
mance to match traditional search engines, while keeping the advantages of
using SPARQL directly. In an enterprise there exist at least two distinct areas
where search technology needs to be applied. On the one hand, there is cor-
porate internal search, which enables employees to find relevant information
required for their work. On the other hand, all large enterprises need at least
simple search capabilities on their public web portal(s), since otherwise the
huge amounts of information provided may not be reachable for potential
customers. Some dedicated companies (e.g., automotive companies) would
actually have a need for more sophisticated query capabilities, since the com-
plexity of offered products is very high. Nevertheless, in reality, search, both
internal and external, is often solely based on keyword matching. We argue
that by employing the Linked Data paradigm in enterprises the classical key-
word-based search can be enhanced. Additionally, more sophisticated search
mechanisms can be easily realized since more information is available in a
uniform and machine-processable format.
Database Integration. Relational Database Management Systems (RDBMS)
are the predominant mode of data storage in the enterprise context. RDBMS
are used practically everywhere in the enterprise, serving, for example, in
computer-aided manufacturing, enterprise resource planning, supply chain
management, and content management systems. We, therefore, deem the inte-
gration of relation data into Linked Data a crucial Enterprise Data Integration
technique. A primary concern when integrating relational data is the scalability
and query performance. With our R2RML-based tool SparqlMap,* we show that
an efficient query translation is possible, thus avoiding the higher deployment
costs associated with the data duplication inherent in ETL approaches. The
challenge of closing the gap between triple stores and relational databases is
also present in SPARQL-to-SQL mappers and drives research. A second chal-
lenge for mapping relational data into RDF is a current lack of best practices

* http://aksw.org/Projects/SparqlMap
178 Big Data Computing

and tool support for mapping creation. The standardization of the RDB to RDF
Mapping Language (R2RML) by the W3C RDB2RDF Working Group establishes
a common ground for an interoperable ecosystem of tools. However, there is
a lack of mature tools for the creation and application of R2RML mappings.
The challenge lies in the creation of user-friendly interfaces and in the estab-
lishment of best practices for creating, integrating, and maintaining those
mappings. Finally, for a read–write integration updates on the mapped data
need to be propagated back into the underlying RDBMS. An initial solution
is presented in [5]. In the context of enterprise data, an integration with granu-
lar access control mechanisms is of vital importance. Consequently, semantic
wikis, query federation tools, and interlinking tools can work with the data of
relation databases. The usage of SPARQL 1.1 query federation [26] allows rela-
tional databases to be integrated into query federation systems with queries
spanning over multiple databases. This federation allows, for example, portals,
which in combination with an EKB provide an integrated view on enterprise
data.
Enterprise Single Sign-On. As a result of the large number of deployed soft-
ware applications in large enterprises, which are increasingly web-based,
single sign-on (SSO) solutions are of crucial importance. A Linked Data-based
approach aimed at tackling the SSO problem is WebID [30]. In order to deploy
a WebID-based SSO solution in large enterprises, a first challenge is to transfer
user identities to the Enterprise Data Web. Those Linked Data identities need to
be enriched and interlinked with further background knowledge, while main-
taining privacy. Thus, mechanisms need to be developed to assure that only
such information is publicly (i.e., public inside the corporation) available, that
is required for the authentication protocol. Another challenge that arises is
related to user management. With WebID a distributed management of identi-
ties is feasible (e.g., on department level), while those identities could still be
used throughout the company. Though this reduces the likeliness of a single
point of failure, it would require the introduction of mechanisms to ensure
that company-wide policies are enforced. Distributed group management and
authorization is already a research topic (e.g., dgFOAF [27]) in the area of social
networks. However, requirements that are gathered from distributed social
network use-cases differ from those captured from enterprise use-cases. Thus,
social network solutions need a critical inspection in the enterprise context.

Linked Data Paradigm for Integrating Enterprise Data


Addressing the challenges from the previous section leads to the creation
of a number of knowledge bases that populate a data intranet. Still, for this
intranet to abide by the vision of Linked Data while serving the purpose of
companies, we need to increase its coherence and establish links between
the data sets. Complex applications that rely on several sources of knowledge
Linked Data in Enterprise Integration 179

usually integrate them into a unified view by the means of the extract-trans-
form-load (ETL) paradigm [13]. For example, IBM’s DeepQA framework [8]
combines knowledge from DBpedia,* Freebase,† and several other knowledge
bases to determine the answer to questions with a speed superior to that of
human champions. A similar view to data integration can be taken within
the Linked Data paradigm with the main difference that the load step can
be discarded when the knowledge bases are not meant to be fused, which
is mostly the case. While the extraction was addressed above, the transfor-
mation remains a complex challenge and has currently not yet been much
addressed in the enterprise context. The specification of this integration pro-
cesses for Linked Data is rendered tedious by several factors, including

1. A great number of knowledge bases (scalability) as well as


2. The Schema mismatches and heterogeneous conventions for prop-
erty values across knowledge bases (discrepancy)

Similar issues are found in the Linked Open Data (LOD) Cloud, which con-
sists of more than 30 billion triples‡ distributed across more than 250 knowl-
edge bases. In the following, we will use the Linked Open Data Cloud as
reference implementation of the Linked Data principles and present semi-
automatic means that aim to ensure high-quality Linked Data Integration.
The scalability of Linked Data Integration has been addressed in manifold
previous works on link discovery. Especially, Link Discovery frameworks such
as LIMES [21–23] as well as time-efficient algorithms such as PPJoin+ [34] have
been designed to address this challenge. Yet, none of these manifold approaches
provides theoretical guarantees with respect to their performance. Thus, so far,
it was impossible to predict how Link Discovery frameworks would perform
with respect to time or space requirements. Consequently, the deployment of
techniques such as customized memory management [2] or time-optimization
strategies [32] (e.g., automated scaling for cloud computing when provided with
very complex linking tasks) was rendered very demanding if not impossible.
A novel approach that addresses these drawbacks is the HR 3 algorithm [20].
Similar to the HYPPO algorithm [22] (on whose formalism it is based), HR 3
assumes that the property values that are to be compared are expressed in an
affine space with a Minkowski distance. Consequently, it can be most naturally
used to process the portion of link specifications that compare numeric values
(e.g., temperatures, elevations, populations, etc.). HR 3 goes beyond the state of
the art by being able to carry out Link Discovery tasks with any achievable reduc-
tion ratio [6]. This theoretical guarantee is of practical importance, as it does
not only allow our approach to be more time-efficient than the state of the art

* http://dbpedia.org
† http://www.freebase.com
‡ http://www4.wiwiss.fu-berlin.de/lodcloud/state/
180 Big Data Computing

but also lays the foundation for the implementation of customized memory
management and time-optimization strategies for Link Discovery.
The difficulties behind the integration of Linked Data are not only caused
by the mere growth of the data sets in the Linked Data Web, but also by large
number of discrepancies across these data sets. In particular, ontology mis-
matches [7] affect mostly the extraction step of the ETL process. They occur
when different classes or properties are used in the source knowledge bases to
express equivalent knowledge (with respect to the extraction process at hand).
For example, while Sider* uses the class sider:side_effects to represent
diseases that can occur as a side effect of the intake of certain medication, the
more generic knowledge base DBpedia uses dbpedia:Disease. Such a mis-
match can lead to a knowledge base that integrates DBpedia and Sider contain-
ing duplicate classes. The same type of mismatch also occurs at the property
level. For example, while Eunis† uses the property eunis:binomialName
to represent the labels of species, DBpedia uses rdfs:label. Thus, even
if the extraction problem was resolved at class level, integrating Eunis and
DBpedia would still lead to the undesirable constellation of an integrated
knowledge base where instances of species would have two properties that
serve as labels. The second category of common mismatches mostly affects
the transformation step of ETL and lies in the different conventions used for
equivalent property values. For example, the labels of films in DBpedia differ
from the labels of films in LinkedMDB‡ in three ways: First, they contain a
language tag. Second, the extension “(film)” if another entity with the same
label exists. Third, if another film with the same label exists, the production
year of the film is added. Consequently, the film Liberty from 1929 has the
label “Liberty (1929 film)@en” in DBpedia, while the same film bears the label
“Liberty” in LinkedMDB. A similar discrepancy in naming persons holds for
film directors (e.g., John Frankenheimer (DBpedia: John Frankenheimer@
en, LinkedMDB: John Frankenheimer (Director)) and John Ford (DBpedia:
John Ford@en, LinkedMDB: John Ford (Director))) and actors. Finding a
conform representation of the labels of movies that maps the LinkedMDB
representation would require knowing the rules replace(“@en”, ε) and
replace(“(*film)”, ε), where ε stands for the empty string.

Runtime Complexity
The development of scalable algorithms for link discovery is of crucial
importance to address for the Big Data problems that enterprises are increas-
ingly faced with. While the variety of the data is addressed by the extraction

* http://sideeffects.embl.de/
† http://eunis.eea.europa.eu/
‡ http://linkedmdb.org/
Linked Data in Enterprise Integration 181

processes presented in the sections above, the mere volume of the data makes
it necessary to have single linking tasks carried out as efficiently as possible.
Moreover, the velocity of the data requires that link discovery is carried out
on a regular basis. These requirements were the basis for the development
of HR 3 [20], the first reduction-ratio-optimal link discovery algorithm. In the
following, we present and evaluate this approach.

Preliminaries
In this section, we present the preliminaries necessary to understand the
subsequent parts of this section. In particular, we define the problem of
Link Discovery, the reduction ratio, and the relative reduction ratio for-
mally as well as give an overview of space tiling for Link Discovery. The
subsequent description of HR 3 relies partly on the notation presented in
this section.
Link Discovery. The goal of Link Discovery is to compute the set of pair of
instances (s, t) ∈ S × T that are related by a relation R, where S and T are two
not necessarily distinct sets of instances. One way of automating this discov-
ery is to compare s ∈ S and t ∈ T based on their properties using a distance
measure. Two entities are then considered to be linked via R if their distance
is less or equal to a threshold θ [23].

Definition 1: Link Discovery on Distances

Given two sets S and T of instances, a distance measure δ over the properties of
s ∈ S and t ∈ T and a distance threshold θ ∈ [0, ∞ [, the goal of Link Discovery is to
compute the set M = {(s, t, δ(s, t)): s ∈ S ∧ t ∈ T ∧ δ(s, t) ≤ θ}
Note that in this paper, we are only interested in lossless solutions, that is,
solutions that are able to find all pairs that abide by the definition given above.
Reduction Ratio. A brute-force approach to Link Discovery would execute
a Link Discovery task on S and T by carrying out |S||T| comparisons. One
of the key ideas behind time-efficient Link Discovery algorithms A is to
reduce the number of comparisons that are effectively carried out to a num-
ber C(A) < |S||T| [29]. The reduction ratio RR of an algorithm A is given by

C( A )
RR(A ) = 1 − . (5.1)
| S  T|

RR(A) captures how much of the Cartesian product |S||T| was not explored
before the output of A was reached. It is obvious that even an optimal loss-
less solution which performs only the necessary comparisons cannot achieve
an RR of 1. Let Cmin be the minimal number of comparisons necessary to
complete the Link Discovery task without losing recall, that is, Cmin = |M
|. We define the relative reduction ratio RRR(A) as the proportion of the
182 Big Data Computing

minimal number of comparisons that was carried out by the algorithm A


before it terminated. Formally,

1 − (Cmin /| S  T |) | S  T | −Cmin
RRR(A ) = = . (5.2)
1 − (C(A )/| S  T |) | S  T | −C(A )

RRR(A) indicates how close A is to the optimal solution with respect to the
number of candidates it tests. Given that C(A) ≥ Cmin, RRR(A) ≥ 1. Note that
the larger the value of RRR(A), the poorer the performance of A with respect
to the task at hand.
The main observation that led to this work is that while most algorithms
aim to optimize their RR (and consequently their RRR), current approaches to
Link Discovery do not provide any guarantee with respect to the RR (and con-
sequently the RRR) that they can achieve. In this work, we present an approach
to Link Discovery in metric spaces whose RRR is guaranteed to converge to 1.
Space Tiling for Link Discovery. Our approach, HR 3, builds upon the same
formalism on which the HYPPO algorithm relies, that is, space tiling. HYPPO
addresses the problem of efficiently mapping instance pairs (s, t) ∈ S × T
described by using exclusively numeric values in an n-dimensional metric
space and has been shown to outperform the state of the art in the previ-
ous work [22]. The observation behind space tiling is that in spaces (Ω, δ )
with orthogonal (i.e., uncorrelated) dimensions,* common metrics for
Link Discovery can be decomposed into the combination of functions
ϕi,i∈{1. . .n}, which operate on exactly one dimension of Ω: δ = f(ϕ1,. . .,ϕn). For
Minkowski distances of  order p, ϕi(x, ω) = |xi − ωi| for all values of i and


n
δ ( x , ω) = p
φ ip ( x , ω )p .
i =1
  A direct consequence of this observation is the inequality ϕi (x,ω) ≤ δ
(x, ω). The basic insight into this observation is that the hypersphere H(ω,
θ) = {x ∈ Ω:δ (x,  ω) ≤θ} is a subset of the hypercube V defined as V(ω,
θ) = {x ∈ Ω: ∀i ∈ {1. . .n}, ϕi (xi, ωi) ≤ θ. Consequently, one can reduce the num-
ber of comparisons necessary to detect all elements of H(ω, θ ) by discarding
all elements that are not in V (ω, θ) as nonmatches. Let Δ = θ/α, where α ∈ℕ 
is the granularity parameter that controls how fine-grained the space tiling
should be (see Figure 5.3 for an example). We first tile Ω into the adjacent
hypercubes (short: cubes) C that contain all the points ω such that

∀ i ∈ {1...n}, ci ∆ ≤ ω i < (ci + 1)∆ with (c1 ,..., cn ) ∈  n . (5.3)


We call the vector (c1, . . ., cn) the coordinates of the cube C. Each point
ω ∈ Ω lies in the cube C(ω) with coordinates ( ω i/∆ )i=1...n . Given such a space

* Note that in all cases, a space transformation exists that can map a space with correlated
dimensions to a space with uncorrelated dimensions.
Linked Data in Enterprise Integration 183

(a) (b) (c)

θ θ θ

Figure 5.3
Space tiling for different values of α. The colored squares show the set of elements that must be
compared with the instance located at the black dot. The points within the circle lie within the
distance θ of the black dot. Note that higher values of α lead to a better approximation of the
hypersphere but also to more hypercubes.

tiling, it is obvious that V(ω,θ ) consists of the union of the cubes such that
∀ i ∈ {1,… , n} :|ci − c(ω )i |≤ α .
Like most of the current algorithms for Link Discovery, space tiling does
not provide optimal performance guarantees. The main goal of this paper is
to build upon the tiling idea so as to develop an algorithm that can achieve
any possible RR. In the following, we present such an algorithm, HR 3.

3
The HR Algorithm
The goal of the HR 3 algorithm is to efficiently map instance pairs (s, t) ∈ S × T
that are described by using exclusively numeric values in an n-dimensional
metric space where the distances are measured by using any Minkowski
distance of order p ≥ 2. To achieve this goal, HR 3 relies on a novel indexing
scheme that allows achieving any RRR greater than or equal to than 1. In
the following, we first present our new indexing scheme and show that we
can discard more hypercubes than simple space tiling for all granularities
α such that n(α − 1)p > α p. We then prove that by these means, our approach

can achieve any RRR greater than 1, therewith proving the optimality of our
indexing scheme with respect to RRR.

Indexing Scheme
Let ω ∈ Ω = S ∪ T be an arbitrary reference point. Furthermore, let δ be the
Minkowski distance of order p. We define the index function as follows:

0 if ∃i :|ci − c(ω )i |≤ 1 with i ∈ {1,..., n},


 n
index(C , ω ) = 
 ∑
 i =1
(|ci − c(ω )i |−1)p else,

 (5.4)
184 Big Data Computing

where C is a hypercube resulting from a space tiling and ω ∈ Ω. Figure 5.4
shows an example of such indices for p = 2 with α = 2 (Figure 5.4a) and α = 4
(Figure 5.4b).
Note that the highlighted square with index 0 contains the reference point
ω. Also note that our indexing scheme is symmetric with respect to C(ω).
Thus, it is sufficient to prove the subsequent lemmas for hypercubes C such
that ci > c(ω)i. In Figure 5.4, it is the upper right portion of the indexed space
with the gray background. Finally, note that the maximal index that a hyper-
cube can achieve is n(α − 1)p as max|ci − ci(ω)| = α per construction of H(ω, θ).
The indexing scheme proposed above guarantees the following:

Lemma 1

Index(C , ω ) = x → ∀s ∈ C(ω ), ∀t ∈ C , δ p (s, t) > x∆ p .

Proof
This lemma is a direct implication of the construction of the index.
Index(C,ω) = x implies that
n

∑ (c − c(ω ) − 1)
i i
p
= x.
i =1

Now given the definition of the coordinates of a cube (Equation (5.3)), the
following holds:

∀s ∈ C(ω ), ∀t ∈ C , |si − ti |≥ (|ci − c(ω )i |−1)∆.


Consequently,
n n

∀s ∈ C(ω ), ∀t ∈ C , ∑ |si − ti |p ≥ ∑ (|c − c(ω ) |−1) ∆ .


i i
p p

i =1 i =1

By applying the definition of the Minkowski distance of the index func-


tion, we finally obtained ∀s ∈ C(ω ), ∀t ∈ C , δ p (s, t) > x∆ p
Note that given that ω ∈ C(ω), the following also holds:

index(C , ω ) = x → ∀t ∈ C : δ p (ω , t) > x∆ p . (5.5)


Approach
The main insight behind HR 3 is that in spaces with Minkowski distances,
the indexing scheme proposed above allows one to safely (i.e., without
Linked Data in Enterprise Integration 185

(a)

2 1 1
0 1 2

1 0 0 0 1

1
0 0 0 0 1
0

1 0 0 0 1

2 1 1
0 1 2

(b)
25 18 13 10 9 0 9 10 13 18

20 13 8 5 4 0 4 5 8 13

17 10 5 2 1 0 1 2 5 10

16 9 4 1 0 0 0 1 4 9

0 0 0 0 0 0 0 0 0 0

16 9 4 1 0 0 0 1 4 9

17 10 5 2 1 0 1 2 5 10

20 13 8 5 4 0 4 5 8 13

25 18 13 10 9 0 9 10 13 18

32 25 20 17 16 0 16 17 20 25

Figure 5.4
​Space tiling and resulting index for a two-dimensional example. Note that the index in both
subfigures was generated for exactly the same portion of space. The black dot stands for the
position of ω.
186 Big Data Computing

dismissing correct matches) discard more hypercubes than when using sim-
ple space tiling. More specifically,

Lemma 2

∀s ∈ S : index(C , s) > α p implies that all t ∈ C are nonmatches.

Proof
This lemma follows directly from Lemma 1 as

index(C , s) > α p → ∀t ∈ C , δ p (s, t) > ∆ pα p = θ p . (5.6)


For the purpose of illustration, let us consider the example of α = 4 and p = 2
in the two-dimensional case displayed in Figure 5.4b. Lemma 2 implies that
any point contained in a hypercube C18 with index 18 cannot contain any ele-
ment t such that δ(s, t) ≤ θ. While space tiling would discard all black cubes in
Figure 5.4b but include the elements of C18 as candidates, HR 3 discards them
and still computes exactly the same results, yet with a better (i.e., smaller) RRR.
One of the direct consequences of Lemma 2 is that n(α − 1)p > α p is a neces-
sary and sufficient condition for HR 3 to achieve a better RRR than simple
space tiling. This is simply due to the fact that the largest index that can be
assigned to a hypercube is ∑ i=1(α − 1)p = n(α − 1)p. Now, if n(α − 1)p > α p, then
n

this cube can be discarded. For p = 2 and n = 2, for example, this condition is

n
satisfied for α ≥ 4. Knowing this inequality is of great importance when decid-
i =1
ing on when to use HR 3 as discussed in the “Evaluation” section.
Let H(α , ω ) = {C : index(C , ω ) ≤ α p }. H(α, ω) is the approximation of the
hypersphere H(ω) = {ω′:δ (ω,ω′) ≤ θ} generated by HR 3. We define the volume
of H(α, ω) as

V (H(α , ω )) =|H(α , ω )|∆ p .


(5.7)

To show that given any r > 1, the approximation H(α, ω) can always achieve
an RRR(HR 3) ≤ r, we need to show the following.

Lemma 3

lim RRR(HR 3, α ) = 1.
α →∞

Proof
The cubes that are not discarded by HR 3 (α) are those for which (|ci − ci(ω)| − 1)
p ≤ α p. When α → ∞, Δ becomes infinitesimally small, leading to the cubes
Linked Data in Enterprise Integration 187

being single points. Each cube C thus contains a single point x with coordi-
nates xi = ciΔ. Especially, ci(ω) = ω. Consequently,

n n p
|xi − ω i| − ∆ 
∑ (|ci − ci (ω )| −1)p ≤ α p ↔ ∑ 

p
 ≤ α . (5.8)
i =1 i =1

Given that θ = Δα, we obtain

n p n
|xi − ω i| − ∆ 

∑i =1
 ∆
p
 ≤ α ↔ ∑ (|x − ω | − ∆)
i =1
i i
p
≤ θ p.

(5.9)

Finally, Δ → 0 when α → ∞ leads to

n n

∑ (|xi − ω i| − ∆ )p ≤ θ p ∧ α → ∞ → ∑|x − ω | ≤ θ .i i
p p
(5.10)
i =1 i =1

This is exactly the condition for linking specified in Definition 1 applied


to Minkowski distances of order p. Consequently, H(ω,∞) is exactly H(ω, θ)
for any θ. Thus, the number of comparisons carried out by HR 3 (α) when α
→ ∞ is exactly Cmin, which leads to the conclusion limα →∞ RRR(HR 3, α ) = 1.
Our conclusion is illustrated in Figure 5.5, which shows the approxima-
tions computed by HR 3 for different values of α with p = 2 and n = 2. The
higher the α, the closer the approximation is to a circle. Note that these
results allow one to conclude that for any RRR-value r larger than 1, there
is a setting of HR 3 that can compute links with a RRR smaller or equal to r.

Evaluation
In this section, we present the data and hardware we used to evaluate our
approach. Thereafter, we present and discuss our results.

Experimental Setup
We carried out four experiments to compare HR 3 with LIMES 0.5’s HYPPO
and SILK 2.5.1. In the first and second experiments, we aimed to deduplicate
DBpedia places by comparing their names (rdfs:label), minimum elevation,
elevation, and maximum elevation. We retrieved 2988 entities that possessed
all four properties. We use the Euclidean metric on the last three values with
the thresholds 49 and 99 m for the first and second experiments, respectively.
The third and fourth experiments aimed to discover links between Geonames
and LinkedGeoData. Here, we compared the labels (rdfs:label), longitude,
188 Big Data Computing

Figure 5.5
​Approximation generated by HR 3 for different values of α. The white squares are selected,
whilst the colored ones are discarded. (a) α = 4, (b) α = 8, (c) α = 10, (d) α = 25, (e) α = 50, and
(f) α = 100.

and latitude of the instances. This experiment was of considerably larger scale
than the first one, as we compared 74,458 entities in Geonames with 50,031
entities from LinkedGeoData. Again, we measured the runtime necessary to
compare the numeric values when comparing them by using the Euclidean
metric. We set the distance thresholds to 1 and 9° in experiments 3 and 4,
respectively. We ran all experiments on the same Windows 7 Enterprise 64-bit
computer with a 2.8 GHz i7 processor with 8 GB RAM. The JVM was allocated
7 GB RAM to ensure that the runtimes were not influenced by swapping. Only
one of the kernels of the processors was used. Furthermore, we ran each of the
experiments three times and report the best runtimes in the following.

Results
We first measured the number of comparisons required by HYPPO and
HR 3 to complete the tasks at hand (see Figure 5.6). Note that we could not
carry out this section of evaluation for SILK 2.5.1 as it would have required
altering the code of the framework. In the experiments 1, 3, and 4, HR 3 can
reduce the overhead in comparisons (i.e., the number of unnecessary compar-
isons divided by the number of necessary comparisons) from approximately
24% for HYPPO to approximately 6% (granularity = 32). In experiment 2, the
overhead is reduced from 4.1 to 2%. This difference in overhead reduction
is mainly due to the data clustering around certain values and the clusters
Linked Data in Enterprise Integration 189

having a radius between 49 and 99 m. Thus, running the algorithms with
a threshold of 99 m led to only a small a priori overhead and HYPPO per-
forming remarkably well. Still, even on such data distributions, HR 3 was
able to discard even more data and to reduce the number of unnecessary
computations by more than 50% relative. In the best case (experiment 4,
α = 32, see Figure 5.6d), HR 3 required approximately 4.13 × 106 less compari-
sons than HYPPO for α = 32. Even for the smallest setting (experiment 1, see
Figure 5.6a), HR 3 still required 0.64 × 106 less comparisons.
We also measured the runtimes of SILK, HYPPO, and HR 3. The best run-
times of the three algorithms for each of the tasks is reported in Figure 5.7.
Note that SILK’s runtimes were measured without the indexing time, as the
data fetching and indexing are merged to one process in SILK. Also note
that in the second experiment, SILK did not terminate due to higher memory
requirements. We approximated SILK’s runtime by extrapolating approxi-
mately 11 min it required for 8.6% of the computation before the RAM was
filled. Again, we did not consider the indexing time.
Because of the considerable difference in runtime (approximately two
orders of magnitude) between HYPPO and SILK, we report solely HYPPO
and HR 3‘s runtimes in the detailed runtimes Figures 5.8a,b. Overall, HR
3 outperformed the other two approaches in all experiments, especially for

α = 4. It is important to note that the improvement in runtime increases with


the complexity of the experiment. For example, while HR 3 outperforms
HYPPO by 3% in the second experiment, the difference grows to more than
7% in the fourth experiment. In addition, the improvement in runtime aug-
ments with the threshold. This can be seen in the third and fourth experi-
ments. While HR 3 is less than 2% faster in the third experiment, it is more
than 7% faster when θ = 4 in the fourth experiment. As expected, HR 3 is
slower than HYPPO for α < 4 as it carries out exactly the same comparisons
but still has the overhead of computing the index. Yet, given that we know that
HR 3 is only better when n(α − 1)p > α p, our implementation only carries out
the indexing when this inequality holds. By these means, we can ensure that
HR 3 is only used when it is able to discard hypercubes that HYPPO would
not discard, therewith reaching superior runtimes both with small and large
values α. Note that the difference between the improvement of the number
of comparisons necessitated by HR 3 and the improvement in runtime over
all experiments is due to the supplementary indexing step required by HR 3.
Finally, we measured the RRR of both HR 3 and HYPPO (see Figures 5.8c
and d). In the two-dimensional experiments 3 and 4, HYPPO achieves an RRR
close to 1. Yet, it is still outperformed by HR 3 as expected. A larger difference
between the RRR of HR 3 and HYPPO can be seen in the three-dimensional
experiments, where the RRR of both algorithms diverge significantly. Note
that the RRR difference grows not only with the number of dimensions, but
also with the size of the problem. The difference in RRR between HYPPO
and HR 3 does not always reflect the difference in runtime due to the index-
ing overhead of HR 3. Still, for α = 4, HR 3 generates a sufficient balance of
190
(a) (b)
6 × 106 8 × 106
5 × 106
5 × 106
4 × 106 8 × 106
4 × 106
3 × 106
HYPPO 7 × 106 HYPPO
3 × 106 HR3 HR3
2 × 106 Minimum Minimum

Number of comparisons
2 × 106 7 × 106

Number of comparisons
106
500 × 103
0 6 × 106
2 4 8 16 32 2 4 8 16 32
Granularity Granularity

(c) (d)
5 × 106 40 × 106
35 × 106
4 × 106
30 × 106
25 × 106
3 × 106
HYPPO 20 × 106 HYPPO
2 × 106 HR3 HR3
Minimum 15 × 106 Minimum

Number of comparisons
10 × 106

Number of comparisons
106
5 × 106
0 0
2 4 8 16 32 2 4 8 16 32
Granularity Granularity

Figure 5.6
Number of comparisons for HR 3 and HYPPO.
Big Data Computing
Linked Data in Enterprise Integration 191

104

103
Runtime (s)

HR3
102 HYPPO
SILK

101

100
Exp. 1 Exp. 2 Exp. 3 Exp. 4

Figure 5.7
Comparison of the runtimes of HR 3, HYPPO, and SILK 2.5.1.

indexing runtime and comparison runtime (i.e., RRR) to outperform HYPPO


in all experiments.

Discrepancy
In this section, we address the lack of coherence that comes about when
integrating data from several knowledge data and using them within one
application. Here, we present CaRLA, the Canonical Representation Learning
Algorithm [19]. This approach addresses the discrepancy problem by learn-
ing canonical (also called conform) representation of data-type property
values. To achieve this goal, CaRLA implements a simple, time-efficient,
and accurate learning approach. We present two versions of CaRLA: a batch
learning and an active learning version. The batch learning approach relies
on a training data set to derive rules that can be used to generate conform
representations of property values. The active version of CaRLA (aCarLa)
extends CaRLA by computing unsure rules and retrieving highly informa-
tive candidates for annotation that allow one to validate or negate these can-
didates. One of the main advantages of CaRLA is that it can be configured
to learn transformations at character, n-gram, or even word level. By these
means, it can be used to improve integration and link discovery processes
based on string similarity/distance measures ranging from character-based
(edit distance) and n-gram-based (q-grams) to word-based (Jaccard similar-
ity) approaches.
(a) 180 (b) 180
192

160 160
140 140
120 120
100 100
HYPPO (Exp. 1) HYPPO (Exp. 3)
80 HR3 (Exp. 1) 80 HR3 (Exp. 3)

Runtime (s)
Runtime (s)
HYPPO (Exp. 2) HYPPO (Exp. 4)
60 HR3 (Exp. 2) 60 HR3 (Exp. 4)

40 40
20 20
0 0
2 4 8 16 32 2 4 8 16 32
Granularity Granularity

(c) (d) 1,005


1,8

1,7
1,004
1,6

1,5 1,003
HYPPO (Exp. 1) HYPPO (Exp. 3)

RRR
1,4

RRR
HR3 (Exp. 1) HR3 (Exp. 3)
HYPPO (Exp. 2) 1,002 HYPPO (Exp. 4)
1,3 HR3 (Exp. 4)
HR3 (Exp. 2)
1,2
1,001
1,1

1 1
2 4 8 16 32 2 4 8 16 32
Granularity Granularity

Figure 5.8
​Comparison of runtimes and RRR of HR 3 and HYPPO. (a) Runtimes for experiments 1 and 2, (b) runtimes for experiments 3 and 4, (c) RRR for experi-
ments 1 and 2, and (d) RRR for experiments 3 and 4.
Big Data Computing
Linked Data in Enterprise Integration 193

Preliminaries
In the following, we define terms and notation necessary to formalize the
approach implemented by CaRLA. Let s ∈ Σ* be a string from an alphabet Σ.
We define a tokenization function as follows:

Definition 2: Tokenization Function

Given an alphabet A of tokens, a tokenization function token: Σ* → 2A maps any


string s ∈ Σ* to a subset of the token alphabet A.
Note that string similarity and distance measures rely on a large number
of different tokenization approaches. For example, the Levenshtein similar-
ity [15] relies on a tokenization at character level, while the Jaccard similarity
[11] relies on a tokenization at word level.

Definition 3: Transformation Rule

A transformation rule is a function r: A → A that maps a token from the alphabet A
to another token of A.
In the following, we will denote transform rules by using an arrow nota-
tion. For example, the mapping of the token “Alan” to “A.” will be denoted
by <“Alan” → “A.” >. For any rule r = <x → y > , we call x the premise and y the
consequence of r. We call a transformation rule trivial when it is of the form
<x → x> with x ∈ A. We call two transformation rules r and r′ inverse to each
other when r = <x → y > and r′ = <y → x>. Throughout this work, we will
assume that the characters that make up the tokens of A belong to Σ ∪ {ε},
where ε stands for the empty character. Note that we will consequently
denote deletions by rules of the form <x → ε > , where x ∈ A.

Definition 4: Weighted Transformation Rule

Let Γ be the set of all rules. Given a weight function w:Γ → ℝ , a weighted transfor-
mation rule is the pair (r,w(r)), where r ∈ Γ is a transformation rule.

Definition 5: Transformation Function

Given a set R of (weighted) transformation rules and a string s, we call the function
φR:Σ* → Σ* ∪ {ε} a transformation function when it maps s to a string φR(s) by
applying all rules ri ∈ R to every token of token(s) in an arbitrary order.
For example, the set R = {<“Alan” → “A.”>} of transformation rules would
lead to φR (“James Alan Hetfield”) = “James A. Hetfield”.
194 Big Data Computing

CaRLA
The goal of CaRLA is two-fold: First, it aims to compute rules that allow
to derive conform representations of property values. As entities can have
several values for the same property, CaRLA also aims to detect a condition
under which two property values should be merged during the integration
process. In the following, we will assume that two source knowledge bases
are to be integrated to one. Note that our approach can be used for any num-
ber of source knowledge bases.
Formally, CaRLA addresses the problem of finding the required transfor-
mation rules by computing an equivalence relation e between pairs of prop-
erty values (p1, p2), that is, such that e(p1, p2) holds when p1 and p2 should be
mapped to the same canonical representation p. CaRLA computes e by generat-
ing two sets of weighted transformation function rules R1 and R 2 such that
for a given similarity function σ ε ( p1 , p2 ) → σ (ϕ R1 ( p1 ), ϕ R2 ( p2 )) ≥ θ , where θ is
a similarity threshold. The canonical representation p is then set to ϕ R1 ( p1 ).
The similarity condition σ (ϕ R1 ( pR1 ), ϕ R2 ( p2 )) ≥ θ is used to distinguish
between the pairs of properties values that should be merged.
To detect R1 and R2, CaRLA assumes two training data sets P and N,
of which N can be empty. The set P of positive training examples is com-
posed of pairs of property value pairs (p1, p2) such that e(p1, p2) holds. The
set N of negative training examples consists of pairs (p1,p2) such that e(p1, p2)
does not hold. In addition, CaRLA assumes being given a similarity func-
tion σ and a corresponding tokenization function token. Given this input,
CaRLA implements a simple three-step approach: It begins by computing
the two sets R1 and R 2 of plausible transformation rules based on the posi-
tive examples at hand (Step 1). Then it merges inverse rules across R1 and R2
and discards rules with a low weight during the rule merging and filtering
step. From the resulting set of rules, CaRLA derives the similarity condition
ε ( p1 , p2 ) → σ (ϕ R1 ( p1 ), ϕ R2 ( p2 )) ≥ θ . It then applies these rules to the negative
examples in N and tests whether the similarity condition also holds for the
negative examples. If this is the case, then it discards rules until it reaches a
local minimum of its error function. The retrieved set of rules and the novel
value of θ constitute the output of CaRLA and can be used to generate the
canonical representation of the properties in the source knowledge bases.
In the following, we explain each of the three steps in more detail.
Throughout the explanation, we use the toy example shown in Table 5.2. In
addition, we will assume a word-level tokenization function and the Jaccard
similarity.

Rule Generation
The goal of the rule generation set is to compute two sets of rules R1 and R2
that will underlie the transformation ϕ R1 and ϕ R2, respectively. We begin by
tokenizing all positive property values pi and pj such that (pi, pj) ∈ P. We call T1
Linked Data in Enterprise Integration 195

Table 5.2
​Toy Example Data Set
Type Property Value 1 Property Value 2
⊕ “Jean van Damne” “Jean Van Damne (actor)”
⊕ “Thomas T. van Nguyen” “Thomas Van Nguyen (actor)”
⊕ “Alain Delon” “Alain Delon (actor)”
⊕ “Alain Delon Jr.” “Alain Delon Jr. (actor)”
⊖ “Claude T. Francois” “Claude Francois (actor)”
Note: The positive examples are of type ⊕ and the negative of type ⊖.

the set of all tokens pi such that (pi, pj) ∈ P, while T2 stands for the set of all pj. We
begin the computation of R1 by extending the set of tokens of each pj ∈ T2 by
adding ε to it. Thereafter, we compute the following rule score function score:

score(< x → y >) =|{( pi , p j ) ∈ P : x ∈ token( pi ) ∧ y ∈ token( p j )}|.


(5.11)

score computes the number of co-occurrences of the tokens x and y across P.


All tokens, x ∈ T1, always have a maximal co-occurrence with ε as it occurs
in all tokens of T2. To ensure that we do not compute only deletions, we
decrease the score of rules <x → ε > by a factor κ ∈ [0, 1]. Moreover, in the
case of a tie, we assume the rule <x → y> to be more natural than <x → y′> if
σ(x, y) > σ(x, y′). Given that σ is bound between 0 and 1, it is sufficient to add a
fraction of σ(x, y) to each rule <x → y> to ensure that the better rule is chosen.
Our final score function is thus given by

score(< x → y >) + σ ( x , y )/2, if y ≠ ε ,


score final (< x → y >) =  (5.12)
κ × score(< x → y >) else.

Finally, for each token x ∈ T1, we add the rule r = < x → y> to R1 iff x ≠ y
(i.e., r is not trivial) and y = argmax y ′∈T2 score final (< x → y ′>). To compute R2,
we simply swap T1 and T2, invert P (i.e., compute the set {(pj, pi): (pi, pj) ∈ P})
and run through the procedure described above.
For the set P in our example, we obtain the following sets of rules:
R1 = {(<“van” → “Van”>, 2.08), (<“T.” → ε >, 2)} and R2 = {(<“Van” → “van”>,
2.08), (<“(actor)” → ε >, 2)}.

Rule Merging and Filtering


The computation of R1 and R2 can lead to a large number of inverse or improb-
able rules. In our example, R1 contains the rule <“van” → “Van”> while R2
contains <“Van” → “van” >. Applying these rules to the data would conse-
quently not improve the convergence of their representations. To ensure that
196 Big Data Computing

the transformation rules lead to similar canonical forms, the rule merging step
first discards all rules <x → y > ∈ R2 such that <y → x > ∈ R1 (i.e., rules in R2
that are inverse to rules in R1). Then, low-weight rules are discarded. The idea
here is that if there is not enough evidence for a rule, it might just be a random
event. The initial similarity threshold θ for the similarity condition is finally
set to

θ = min σ (ϕ R1 ( p1 ), ϕ R2 ( p2 )). (5.13)


( p1 , p2 )∈P

In our example, CaRLA would discard <“van” → “Van”> from R 2. When


assuming a threshold of 10% of P’s size (i.e., 0.4), no rule would be filtered
out. The output of this step would consequently be R1 = {(<“van” → “Van”>,
2.08), (<“T.” → ε >, 2)} and R2 = {(<“(actor)” → ε >, 2)}.

Rule Falsification
The aim of the rule falsification step is to detect a set of transformations that
lead to a minimal number of elements of N having a similarity superior to θ
via σ. To achieve this goal, we follow a greedy approach that aims to mini-
mize the magnitude of the set


{ ( p1 , p2 )∈P }
E = ( p1 , p2 ) ∈ N : σ (ϕ R1 ( p1 ), ϕ R2 ( p2 )) ≥ θ = min σ (ϕ R1 ( p1 ), ϕ R2 ( p2 )) . (5.14)

Our approach simply tries to discard all rules that apply to elements of E
by ascending score. If E is empty, then the approach terminates. If E does
not get smaller, then the change is rolled back and the next rule is tried.
Else, the rule is discarded from the set of final rules. Note that discarding a
rule can alter the value of θ and thus E. Once the set E has been computed,
CaRLA concludes its computation by generating a final value of the thresh-
old θ.
In our example, two rules apply to the element of N. After discarding the
rule <”T.” → ε >, the set E becomes empty, leading to the termination of the
rule falsification step. The final set of rules are thus R1 = {<“van” → “Van”>}
and R 2 = {<“(actor)” → ε >}. The value of θ is computed to be 0.75. Table 5.3
shows the canonical property values for our toy example. Note that this
threshold allows to discard the elements of N as being equivalent property
values.
It is noteworthy that by learning transformation rules, we also found an
initial threshold θ for determining the similarity of property values using
σ as similarity function. In combination with the canonical forms com-
puted by CaRLA, the configuration (σ, θ) can be used as an initial configu-
ration for Link Discovery frameworks such as LIMES. For example, the
Linked Data in Enterprise Integration 197

Table 5.3
Canonical Property Values for Our Example Data Set
Property Value 1 Property Value 2 Canonical Value
“Jean van Damne” “Jean Van Damne (actor)” “Jean Van Damne”
“Thomas T. van Nguyen” “Thomas Van Nguyen (actor)” “Thomas T. Van Nguyen”
“Alain Delon” “Alain Delon (actor)” “Alain Delon”
“Alain Delon Jr.” “Alain Delon Jr. (actor)” “Alain Delon Jr.”
“Claude T. Francois” “Claude T. Francois”
“Claude Francois (actor)” “Claude Francois”

smallest Jaccard similarity for the pair of property values for our exam-
ple lies by 1/3, leading to a precision of 0.71 for a recall of 1 (F-measure:
0.83). Yet, after the computation of the transformation rules, we reach an
F-measure of 1 with a threshold of 1. Consequently, the pair (σ, θ ) can
be used for determining an initial classifier for approaches such as the
RAVEN algorithm [24].

Extension to Active Learning


One of the drawbacks of batch learning approaches is that they often require
a large number of examples to generate good models. As our evaluation
shows (see the “Evaluation” section), this drawback also holds for the batch
version of CaRLA, as it can easily detect very common rules but sometimes
fails to detect rules that apply to less pairs of property values. In the follow-
ing, we present how this problem can be addressed by extending CaRLA to
aCARLA using active learning [28].
The basic idea here is to begin with small training sets P0 and N0. In each
iteration, all the available training data are used by the batch version of
CaRLA to update the set of rules. The algorithm then tries to refute or vali-
date rules with a score below the score threshold smin (i.e., unsure rules). For
this purpose, it picks the most unsure rule r that has not been shown to be
erroneous in a previous iteration (i.e., that is not an element of the set of
banned rules B). It then fetches a set Ex of property values that map the left
side (i.e., the premise) of r. Should there be no unsure rule, then Ex is set to
the q property values that are most dissimilar to the already known prop-
erty values. Annotations consisting of the corresponding values for the ele-
ments of Ex in the other source knowledge bases are requested by the user
and written in the set P. Property values with no corresponding values are
written in N. Finally, the sets of positive and negative examples are updated
and the triple (R1, R 2, θ ) is learned anew until a stopping condition such as
a maximal number of questions is reached. As our evaluation shows, this
simple extension of the CaRLA algorithm allows it to detect efficiently the
pairs of annotations that might lead to a larger set of high-quality rules.
198 Big Data Computing

Evaluation
Experimental Setup
In the experiments reported in this section, we evaluated CaRLA by two
means: First, we aimed to measure how well CaRLA could compute trans-
formations created by experts. To achieve this goal, we retrieved transforma-
tion rules from four link specifications defined manually by experts within
the LATC project.* An overview of these specifications is given in Table 5.4.
Each link specification aimed to compute owl:sameAs links between enti-
ties across two knowledge bases by first transforming their property values
and by then computing the similarity of the entities based on the similar-
ity of their property values. For example, the computation of links between
films in DBpedia and LinkedMDB was carried out by first applying the set of
R1 = {<(film) → ε >} to the labels of films in DBpedia and R 2 = {<(director) → ε >}
to the labels of their directors. We ran both CaRLA and aCaRLA on the prop-
erty values of the interlinked entities and measured how fast CaRLA was
able to reconstruct the set of rules that were used during the Link Discovery
process.
In addition, we quantified the quality of the rules learned by
CaRLA. In each experiment, we computed the boost in the precision
of the mapping of property pairs with and without the rules derived
by CaRLA. The initial precision was computed as |P|/|M|, where
M = {( pi , p j ) : σ ( pi , p j ) ≥ min( p1 , p2 )∈P σ ( p1 , p2 )}. The precision after apply-
ing CaRLA’s results was computed as |P|/|M′|, where M′ = {(pi,pj):
σ (ϕ R1 ( pi ), ϕ R2 ( p j )) ≥ min( p1 , p2 )∈P σ (ϕ R1 ( p1 ), ϕ R2 ( p2 ))}. Note that in both cases,
the recall was 1 given that ∀( pi , p j ) ∈ P : σ ( pi , p j ) ≥ min( p1 , p2 )∈P σ ( p1 , p2 ). In all
experiments, we used the Jaccard similarity metric and a word tokenizer
with κ = 0.8. All runs were carried on a notebook running Windows 7
Enterprise with 3 GB RAM and an Intel Dual Core 2.2 GHz processor. Each
of the algorithms was ran five times. We report the rules that were discov-
ered by the algorithms and the number of experiments within which they
were found.

Table 5.4
Overview of the Data Sets
Experiment Source Target Source Property Target Property Size
Actors DBpedia LinkedMDB rdfs:label rdfs:label 1172
Directors DBpedia LinkedMDB rdfs:label rdfs:label 7353
Movies DBpedia LinkedMDB rdfs:label rdfs:label 9859
Producers DBpedia LinkedMDB rdfs:label rdfs:label 1540

* http://latc-project.eu
Linked Data in Enterprise Integration 199

Results and Discussion


Table 5.5 shows the union of the rules learned by the batch version of CaRLA
in all five runs. Note that the computation of a rule set lasted under 0.5 s
even for the largest data set, that is, Movies. The columns Pn give the prob-
ability of finding a rule for a training set of size n in our experiments. R 2
is not reported because it remained empty in all setups. Our results show
that in all cases, CaRLA converges quickly and learns rules that are equiva-
lent to those utilized by the LATC experts with a sample set of 5 pairs. Note
that for each rule of the form <“@en” → y> with y ≠ ε that we learned, the
experts used the rule <y → ε > while the linking platform automatically
removed the language tag. We experimented with the same data sets with-
out language tags and computed exactly the same rules as those devised by
the experts. In some experiments (such as Directors), CaRLA was even able
detect rules that where not included in the set of rules generated by human
experts. For example, the rule <“(filmmaker)” → “(director)”> is not very
frequent and was thus overlooked by the experts. In Table 5.5, we marked
such rules with an asterix. The Director and the Movies data sets contained
a large number of typographic errors of different sort (incl. misplaced
hyphens, character repetitions such as in the token “Neilll”, etc.), which led
to poor precision scores in our experiments. We cleaned the first 250 entries
of these data sets from these errors and obtained the results in the rows
labels Directors_clean and Movies_clean. The results of CaRLA on these
data sets are also shown in Table 5.5. We also measured the improvement
in precision that resulted from applying CaRLA to the data sets at hand
(see Figure 5.9). For that the precision remained constant across the differ-
ent data set sizes. In the best case (cleaned Directors data set), we are able
to improve the precision of the property mapping by 12.16%. Note that we

Table 5.5
Overview of Batch Learning Results
Experiment R1 P5 P10 P20 P50 P100
Actors <“@en” → “(actor)”> 1 1 1 1 1
Directors <“@en” → “(director)”> 1 1 1 1 1
<“(filmmaker)” → “(director)”>* 0 0 0 0 0.2
Directors_clean <“@en” → “(director)”> 1 1 1 1 1
Movies <“@en” → ε > 1 1 1 1 1
<“(film)” → ε > 1 1 1 1 1
<“film)” → ε >* 0 0 0 0 0.6
Movies_clean <“@en” → ε > 1 1 1 1 1
<“(film)” → ε > 0 0.8 1 1 1
<“film)” → ε >* 0 0 0 0 1
Producers <“@en” → (producer)> 1 1 1 1 1
200 Big Data Computing

(a)
100%

80%

60%
Precision

Baseline
CaRLA
40%

20%

0%
Actors Directors Directors_clean Movies Movies_clean Producers

(b)
100%

80%

60%
Threshold

Baseline
CaRLA
40%

20%

0%
Actors Directors Directors_clean Movies Movies_clean Producers

Figure 5.9
Comparison of the precision and thresholds with and without CaRLA. (a) Comparison of the
precision with and without CaRLA. (b) Comparison of the thresholds with and without CaRLA.

can improve the precision of the mapping of property values even on the
noisy data sets.
Interestingly, when used on the Movies data set with a training data
set size of 100, our framework learned low-confidence rules such as
<“(1999” → ε >, which were yet discarded due to a too low score. These are
the cases where aCaRLA displayed its superiority. Thanks to its ability to
ask for annotation when faced with unsure rules, aCaRLA is able to vali-
date or negate unsure rules. As the results on the Movies example show,
Linked Data in Enterprise Integration 201

Table 5.6
Overview of Active Learning Results
Experiment R1 P5 P10 P20 P50 P100
Actors <“@en” → “(actor)”> 1 1 1 1 1
Directors <“@en” → “(director)”> 1 1 1 1 1
<“(actor)” → “(director)”>* 0 0 0 0 1
Directors_clean <“@en” → “(director)”> 1 1 1 1 1
Movies <“@en” → ε > 1 1 1 1 1
<“(film)” → ε > 1 1 1 1 1
<“film)” → ε >* 0 0 0 0 1
<“(2006” → ε >* 0 0 0 0 1
<“(199” → ε >* 0 0 0 0 1
Movies_clean <“@en” → ε > 1 1 1 1 1
<“(film)” → ε > 0 1 1 1 1
<“film)” → ε >* 0 0 0 0 1
Producers <“@en” → (producer)> 1 1 1 1 1

aCaRLA is able to detect several supplementary rules that were overlooked


by human experts. Especially, it clearly shows that deleting the year of cre-
ation of a movie can improve the conformation process. aCaRLA is also able
to generate a significantly larger number of candidate rules for the user’s
convenience. For example, it detects a large set of low-confidence rules
such as <“(actress)” → “(director)”>, <“(actor)” → “(director)”> and <“(actor/
director)” → “(director)”> on the Directors data set. Note that in one case
aCARLA misses the rule <“(filmmaker)” → “(director)” > that is discovered
by CaRLA with a low probability. This is due to the active learning pro-
cess being less random. The results achieved by aCaRLA on the same data
sets are shown in Table 5.6. Note that the runtime of aCaRLA lied between
50 ms per iteration (cleaned data sets) and 30 s per iteration (largest data set,
Movies). The most time-expensive operation was the search for the prop-
erty values that were least similar to the already known ones.

Conclusion
In this chapter, we introduced a number of challenges arising in the context
of Linked Data in Enterprise Integration. A crucial prerequisite for address-
ing these challenges is to establish efficient and effective link discovery and
data integration techniques, which scale to large-scale data scenarios found
in the enterprise. We addressed the transformation and linking steps of the
Linked Data Integration by presenting two algorithms, HR 3 and CaRLA. We
proved that HR 3 is optimal with respect to its reduction ratio by showing
202 Big Data Computing

that its RRR converges toward 1 when α converges toward ∞. HR 3 aims to be


the first of a novel type of Link Discovery approaches, that is, approaches that
can guarantee theoretical optimality, while also being empirically usable. In
the future work, more such approaches will enable superior memory and
space management. CaRLA uses batch and active learning approach to dis-
cover a large number of transformation rules efficiently and was shown to
increase the precision of property mapping by up to 12% when the recall is
set to 1. In addition, CaRLA was shown to be able to detect rules that escaped
experts while devising specifications for link discovery.

References
1. S. Auer, J. Lehmann, and S. Hellmann. LinkedGeoData: Adding a spatial
dimension to the Web of Data. The Semantic Web-ISWC 2009, pp. 731–746,
2009.
2. F. C. Botelho and N. Ziviani. External perfect hashing for very large key sets. In
CIKM, pp. 653–662, 2007.
3. T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau. Extensible
Markup Language (XML) 1.0 (Fifth Edition). W3C, 2008.
4. J. Clark and M. Makoto. RELAX NG Specification. Oasis, December 2001. http://
www.oasis-open.org/committees/relax-ng/spec-20011203.html.
5. V. Eisenberg and Y. Kanza. D2RQ/update: Updating relational data via virtual
RDF. In WWW (Companion Volume), pp. 497–498, 2012.
6. M. G. Elfeky, A. K. Elmagarmid, and V. S. Verykios. Tailor: A record linkage tool
box. In ICDE, pp. 17–28, 2002.
7. J. Euzenat and P. Shvaiko. Ontology Matching. Springer-Verlag, Heidelberg,
2007.
8. D. A. Ferrucci, E. W. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. Kalyanpur,
A. Lally et al. Building Watson: An overview of the deepQA project. AI Magazine,
31(3):59–79, 2010.
9. A. Halevy, A. Rajaraman, and J. Ordille. Data integration: The teenage years. In
Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB’06),
pp. 9–16. VLDB Endowment, 2006.
10. A. Hogan, A. Harth, and A. Polleres. Scalable authoritative OWL reasoning for
the web. International Journal on Semantic Web and Information Systems (IJSWIS),
5(2):49–90, 2009.
11. P. Jaccard. Étude comparative de la distribution florale dans une portion des
Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles, 37:547–
579, 1901.
12. R. Jelliffe. The Schematron—An XML Structure Validation Language using Patterns
in Trees. ISO/IEC 19757, 2001.
13. R. Kimball and J. Caserta. The Data Warehouse ETL Toolkit: Practical Techniques for
Extracting, Cleaning, Conforming, and Delivering Data. Wiley, Hoboken, NJ, 2004.
14. M. Krötzsch, D. Vrandečić, and M. Völkel. Semantic Media Wiki. The Semantic
Web-ISWC 2006, pp. 935–942, 2006.
Linked Data in Enterprise Integration 203

15. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and


reversals. Technical Report 8, 1966.
16. A. Majchrzak, C. Wagner, and D. Yates. Corporate wiki users: Results of a
­survey. In WikiSym’06: Proceedings of the 2006 International Symposium on Wikis,
Odense, Denmark. ACM, August 2006.
17. A. Miles and S. Bechhofer. SKOS Simple Knowledge Organization
System Reference. W3C Recommendation, 2008. http://www.w3.org/TR/
skos-reference/.
18. R. Mukherjee and J. Mao. Enterprise search: Tough stuff. Queue, 2(2):36, 2004.
19. A.-C. Ngonga Ngomo. Learning conformation rules for linked data integration.
In International Workshop on Ontology Matching, Boston, USA, 2012.
20. A.-C. Ngonga Ngomo. Link discovery with guaranteed reduction ratio in affine
spaces with Minkowski measures. In International Semantic Web Conference (1),
Boston, USA, pp. 378–393, 2012.
21. A.-C. Ngonga Ngomo. On link discovery using a hybrid approach. J. Data
Semantics, 1(4):203–217, 2012.
22. A.-C. Ngonga Ngomo. A time-efficient hybrid approach to link discovery. In
Sixth International Ontology Matching Workshop, Bonn, Germany, 2011.
23. A.-C. Ngonga Ngomo and S. Auer. LIMES—A time-efficient approach for
large-scale link discovery on the web of data. In Proceedings of IJCAI, Barcelona,
Catalonia, Spain, 2011.
24. A.-C. Ngonga Ngomo, J. Lehmann, S. Auer, and K. Höffner. RAVEN—Active
learning of link specifications. In Proceedings of OM@ISWC, Bonn, Germany,
2011.
25. ontoprise. SMW + —Semantic Enterprise Wiki, 2012. http://www.smwplus.
com.
26. E. Prud’hommeaux. SPARQL 1.1 Federation Extensions, November 2011.
http://www.w3.org/TR/sparql11-federated-query/.
27. F. Schwagereit, A. Scherp, and S. Staab. Representing distributed groups with
dgFOAF. The Semantic Web: Research and Applications, pp. 181–195, 2010.
28. B. Settles. Active learning literature survey. Technical Report 1648, University of
Wisconsin-Madison, 2009.
29. D. Song and J. Heflin. Automatically generating data linkages using a domain-
independent candidate selection approach. In ISWC, Boston, USA, pp. 649–664,
2011.
30. M. Sporny, T. Inkster, H. Story, B. Harbulot, and R. Bachmann-Gmür. WebID 1.0:
Web identification and Discovery. W3C Editors Draft, December 2011. http://
www.w3.org/2005/Incubator/webid/spec/.
31. H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part
1: Structures (Second Edition). W3C, 2004.
32. L. M. Vaquero, L. Rodero-Merino, and R. Buyya. Dynamically scaling applica-
tions in the cloud. SIGCOMM Comput. Commun. Rev., 41:45–52.
33. H. Wache, T. Voegele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann,
and S. Hübner. Ontology-based integration of information—A survey of exist-
ing approaches. IJCAI-01 Workshop: Ontologies and Information Sharing, 2001:108–
117, 2001.
34. C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate
detection. In WWW, Beijing, China, pp. 131–140, 2008.
This page intentionally left blank
6
Scalable End-User Access to Big Data

Martin Giese, Diego Calvanese, Peter Haase, Ian Horrocks,


Yannis Ioannidis, Herald Kllapi, Manolis Koubarakis, Maurizio Lenzerini,
Ralf Möller, Mariano Rodriguez Muro, Özgür Özçep, Riccardo Rosati,
Rudolf Schlatte, Michael Schmidt, Ahmet Soylu, and Arild Waaler

Contents
Data Access Problem of Big Data....................................................................... 206
Ontology-Based Data Access.............................................................................. 207
Example............................................................................................................ 209
Limitations of the State of the Art in OBDA................................................ 212
Query Formulation Support............................................................................... 215
Ontology and Mapping Management.............................................................. 219
Query Transformation.........................................................................................222
Time and Streams.................................................................................................225
Distributed Query Execution............................................................................. 231
Conclusion............................................................................................................ 235
References.............................................................................................................. 236

This chapter proposes steps toward the solution to the data access problem that
end-users typically face when dealing with Big Data:

• They need to pose ad-hoc queries to a collection of data sources, pos-


sibly including streaming sources.
• They are unable to query these sources on their own, but are depen-
dent on assistance from IT experts.
• The turnaround time for information requests is in the range of
days, possibly weeks, due to the involvement of the IT personnel.
• The volume, complexity, variety, and velocity of the underlying data
sources put very high demands on the scalability of the solution.

We propose to approach this problem using ontology-based data access


(OBDA), the idea being to capture end-user conceptualizations in an ontol-
ogy and use declarative mappings to connect the ontology to the underlying

205
206 Big Data Computing

data sources. End-user queries posed are in terms of concepts of ontology


and are then rewritten as queries against the sources.
The chapter is structured as follows. First, in the “The Data Access Problem
of Big Data” section, we situate the problem within the more general dis-
cussion about Big Data. Then, in section “Ontology-Based Data Access,”
we review the state of the art in OBDA, explain why we believe OBDA is a
superior approach to the data access challenge posed by Big Data, and also
explain why the field of OBDA is currently not yet sufficiently mature to deal
satisfactory with these problems. The rest of the chapter contains concepts
for raising OBDA to a level where it can be successfully deployed to Big Data.
The ideas proposed in this chapter are investigated and implemented
in the FP7 Integrated Project Optique—Scalable End-user Access to Big Data,
which runs until the end of year 2016. The Optique solutions are evaluated on
two comprehensive use cases from the energy sector with a variety of data
access challenges related to Big Data.*

Data Access Problem of Big Data


The situation in knowledge- and data-intensive enterprises is typically as fol-
lows. Massive amounts of data, accumulated in real time and over decades,
are spread over a wide variety of formats and sources. End-users operate
on these collections of data using specialized applications, the operation
of which requires expert skills and domain knowledge. Relevant data are
extracted from the data sources using predefined queries that are built into
the applications. Moreover, these queries typically access just some specific
sources with identical structure. The situation can be illustrated like this:

Uniform sources
Simple Application Predefined queries
case
Engineer

In these situations, the turnaround time, by which we mean the time from
when the end-user delivers an information need until the data are there, will
typically be in the range of minutes, maybe even seconds, and Big Data tech-
nologies can be deployed to dramatically reduce the execution time for queries.
Situations where users need to explore the data using ad hoc queries are
considerably more challenging, since accessing relevant parts of the data
typically requires in-depth knowledge of the domain and of the organiza-
tion of data repositories. It is very rare that the end-users possess such skills
themselves. The situation is rather that the end-user needs to collaborate
* See http://www.optique-project.eu/
Scalable End-User Access to Big Data 207

with an IT-skilled person in order to jointly develop the query that solves the
problem at hand, illustrated in the figure below:

Translation
Disparate sources
Complex Information need Specialized query
case
Engineer IT expert

The turnaround time is then mostly dependent on human factors and is in


the range of days, if not worse. Note that the typical Big Data technologies
are of limited help in this case, as they do not in themselves eliminate the
need for the IT expert.
The problem of end-user data access is ultimately about being able to put
the enterprise data in the hands of the expert end-users. Important aspects of
the problem are volume, variety, velocity, and complexity (Beyer et al., 2011),
where by volume we mean the complete size of the data, by variety we mean
the number of different data types and data sources, by velocity we mean
the rate at which data streams in and how fast it needs to be processed, and
by complexity we mean factors such as standards, domain rules, and size of
database schemas that in normal circumstances are manageable, but quickly
complicate data access considerably when they escalate.
Factors such as variety, velocity, and complexity can make data access chal-
lenging even with fairly small amounts of data. When, in addition to these
factors, data volumes are extreme, the problem becomes seemingly intrac-
table; one must then not only deal with large data sets, but at the same time
also has to cope with dimensions that to some extent are complementary. In
Big Data scenarios, one or more of these dimensions go to the extreme, at the
same time interacting with other dimensions.
Based on the ideas presented in this chapter, the Optique project imple-
ments a solution to the data access problem for Big Data in which all the
above-mentioned dimensions of the problem are addressed. The goal is to
enable expert end-users access the data themselves, without the help of the
IT experts, as illustrated in this figure.

Flexible,
Optique Query Translated Disparate sources
Optique ontology-
trans-
solution Application based lation queries
Engineer queries

Ontology-Based Data Access


We have observed that, in end-user access to Big Data, there exists a bot-
tleneck in the process of translating end-users' information needs into
208 Big Data Computing

End-user IT expert

Appli-
cation

Query Ontology Mappings


Results

Query answering

Figure 6.1
The basic setup for OBDA.

executable and optimized queries over data sources. An approach known


as “ontology-based data access” (OBDA) has the potential to avoid this bot-
tleneck by automating this query translation process. Figure 6.1 shows the
essential components in an OBDA setup.
The main idea is to use an ontology, or domain model, that is a formaliza-
tion of the vocabulary employed by the end-users to talk about the problem
domain. This ontology is constructed entirely independent of how the data
are actually stored. End-users can formulate queries using the terms defined
by the ontology, using some formal query language. In other words, queries
are formulated according to the end-users’ view of the problem domain.
To execute such queries, a set of mappings is maintained which describe the
relationship between the terms in the ontology and their representation(s) in
the data sources. This set of mappings is typically produced by the IT expert,
who previously translated end-users’ queries manually.
It is now possible to give an algorithm that takes an end-user query, the
ontology, and a set of mappings as inputs, and computes a query that can be
executed over the data sources, which produces the set of results expected
for the end-user query. As Figure 6.1 illustrates, the result set can then be
fed into some existing domain-specific visualization or browsing application
which presents it to the end-user.
In the next section, we will see an example of such a query translation pro-
cess, which illustrates the point that including additional information about
the problem domain in the ontology can be very useful for end-users. In gen-
eral, this process of query translation becomes much more complex than just
substituting pieces of queries using the mappings. The generated query can
Scalable End-User Access to Big Data 209

also, in some cases, become dramatically larger than the original ontology-
based query formulated by the end-user.
The theoretical foundations of OBDA have been thoroughly investigated
in recent years (Möller et  al., 2006; Calvanese et  al., 2007a,b; Poggi et  al.,
2008). There is a very good understanding of the basic mechanisms for
query rewriting, and the extent to which expressivity of ontologies can be
increased, while maintaining the same theoretical complexity as is exhibited
by standard relational database systems.
Also, prototypical implementations exist (Acciarri et al., 2005; Calvanese
et al., 2011), which have been applied to minor industrial case studies (e.g.,
Amoroso et al., 2008). They have demonstrated the conceptual viability of the
OBDA approach for industrial purposes.
There are several features of a successful OBDA implementation that lead
us to believe that it is the right basic approach to the challenges of end-user
access to Big Data:

• It is declarative, that is, there is no need for end-users, nor for IT


experts, to write special-purpose program code.
• Data can be left in existing relational databases. In many cases, mov-
ing large and complex data sets is impractical, even if the data owners
were to allow it. Moreover, for scalability it is essential to exploit exist-
ing optimized data structures (tables), and to avoid increasing query
complexity by fragmenting data. This is in contrast to, for example,
data warehousing approaches that copy data: OBDA is more flexible
and offers an infrastructure which is simpler to set up and maintain.
• It provides a flexible query language that corresponds to the end-
user conceptualization of the data.
• The ontology can be used to hide details and introduce abstractions.
This is significant in cases where there is a source schema which is
too complex for the end-user.
• The relationship between the ontology concepts and the relational
data is made explicit in the mappings. This provides a means for the
DB experts to make their knowledge available to the end-user inde-
pendent of specific queries.

Example
We will now present a (highly) simplified example that illustrates some of the
benefits of OBDA and explains how the technique works. Imagine that an engi-
neer working in the power generation industry wants to retrieve data about
generators that have a turbine fault. The engineer is able to formalize this infor-
mation need, possibly with the aid of a suitable tool, as a que