NoSQL Systems For Big Data Management | PDF | No Sql | Scalability
0% found this document useful (0 votes)
124 views

NoSQL Systems For Big Data Management

no sql database

Uploaded by

HenryStephen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views

NoSQL Systems For Big Data Management

no sql database

Uploaded by

HenryStephen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2014 IEEE 10th World Congress on Services

NoSQL Systems for Big Data Management

Venkat N Gudivada
Dhana Rao
Vijay V. Raghavan
Weisburg Division of Computer Science Biological Sciences Department Center for Advanced Computer Studies
Marshall University
Marshall University
University of Louisiana at Lafayette
Huntington, WV, USA
Huntington, WV, USA
Lafayette, LA, USA
gudivada@marshall.edu
raod@marshall.edu
vijay@cacs.louisiana.edu

NoSQL databases require developer-centric approach


from the application inception to completion. For example, data modeling is done by the application architect
or developer, whereas in RDBMS based applications,
data architects and data modelers complete the data
modeling tasks. They begin by constructing conceptual
and logical data models. Database transactions and
queries come into play only in the design of a physical database. In contrast, NoSQL database approaches
begin by identifying application queries and structure
the data model to eciently support these queries. In
other words, there is a strong coupling between the
data model and application queries. Any changes to the
queries will necessitate changes to the data model. This
approach is in stark contrast to the time-tested RDBMS
principles of logical and physical data independence.
Traditionally, data is viewed as a strategic and shared
corporate resource with well established policies and
procedures for data governance and quality control. In
contrast, NoSQL systems promote data silos, each silo
geared towards meeting the performance and scalability
requirements of just one or more applications. This
runs against the ethos of enterprise data integration,
redundancy control, and integrity constraints [5].
Given the above backdrop, following are the contributions of this paper. We provide a taxonomy and unied
perspective to help the reader understand the functionality, strengths and limitations of NoSQL databases.
Using this perspective, we compare and contrast various NoSQL systems using multiple facets which include
system architecture, data model, query language, client
API, scalability, and availability. We group the current
NoSQL systems into seven broad categories: Key-Value,
Table-type/Columnar, Document, Graph, Native XML,
Native Object, and Hybrid databases. Members of these
categories are not always disjoint, with some systems
falling into more than one category. We also describe
application scenarios for each category to help the
reader choose an appropriate NoSQL system for a given
application. We conclude the paper by indicating future
research direction.
We begin with a discussion of the principal features

AbstractThe advent of Big Data created a need for


out-of-the-box horizontal scalability for data management
systems. This ushered in an array of choices for Big
Data management under the umbrella term NoSQL. In this
paper, we provide a taxonomy and unied perspective
on NoSQL systems. Using this perspective, we compare
and contrast various NoSQL systems using multiple facets
including system architecture, data model, query language,
client API, scalability, and availability. We group current
NoSQL systems into seven broad categories: Key-Value,
Table-type/Column, Document, Graph, Native XML, Native
Object, and Hybrid databases. We also describe application
scenarios for each category to help the reader in choosing
an appropriate NoSQL system for a given application. We
conclude the paper by indicating future research directions.
Keywords-Data Models, NoSQL, NewSQL, Big Data, Graph
Databases, Document Databases, Native XML Databases

I. Introduction
Until recently, relational database management systems (RDBMS) were the mainstay for managing all types
of data irrespective of their naturally t to the relational
data model. The emergence of Big Data and mobile
computing necessitated new database functionality to
support applications such as real-time logle analysis,
cross-selling in eCommerce, location based services,
and micro-blogging. Many of these applications exhibit
a preponderance of insert and retrieve operations on a
very large scale. Relational databases were found to be
inadequate in handling the scale and varied structure
of data.
The above requirements ushered in an array of
choices for Big Data management under the umbrella
term NoSQL [1], [2], [3]. NoSQL (meaning not only
SQL) has come to describe a large class of databases
which do not have properties of traditional relational
databases and are generally not queried with SQL
(structured query language). NoSQL systems provide
data partitioning and replication as built-in features.
They typically run on cluster computers made from
commodity hardware and provide horizontal scalability.
Developing applications using NoSQL systems is
quite dierent from the process used with RDBMS [4].
978-1-4799-5069-0/14 $31.00 2014 IEEE
DOI 10.1109/SERVICES.2014.42

206
190

subject matter expert (SME) thinks about the data in


the domain in the context of a specic application. In
other words, a NoSQL application works with a data
model that is specically designed and optimized for
that application.

of NoSQL systems (section II). In section III, we describe


concepts that are essential to understanding NoSQL
systems from both a technical standpoint and business
perspective. Sections IV through X describe the classes
of NoSQL systems along with their application scenarios. Section XI concludes the paper.

C. Eventual Consistency

II. Principal Features of NoSQL


The need for NoSQL database solutions primarily
emanated from the requirements of eCommerce, mobile
computing and location based services, Web services,
and social media applications. eCommerce applications
require predictive analytics, personalization, dynamic
pricing, superior customer service, fraud and anomaly
detection, real-time order status through supply chain
visibility, and Web server access log analysis. All of
the above functions are characterized by massive data
which need to be fused from multiple sources and
processed in real-time.

Create/Insert, Read, Update, and Delete (CRUD) are


four operations that databases are expected to provide
and execute eciently. NoSQL data models are structured to support massively concurrent insert and read
operations relative to updates. The notion of eventual
consistency is embraced to support extremely fast insert
and read operations. Eventual consistency implies that
applications are aware of the non-repeatable read issue
due to latency in consistency. Some NoSQL systems
employ data models to eciently support only insert
and read to the exclusion of update and delete.

A. Big Data Characteristics

D. Partial Record Updates

Data that is too big and complex to capture, store,


process, analyze, and interpret using the state-of-theart tools and methods is referred to as Big Data. It
is characterized by ve Vs: Volume, Velocity, Variety,
Value, and Veracity. Data size (Volume) is in the order
of terabytes or even petabytes, and is rapidly heading
towards exabytes. Velocity refers to the speed at which
the data is generated. Enterprise Big Data is typically
heterogeneous (Variety) and is comprised of structured,
semi-structured, and unstructured data. Value refers
to extracting useful and actionable information from
massive data sets. Due to diversity in sources and evolution of data through intermediate processing, issues
of security, privacy, trust, and accountability need to be
addressed using secure data provenance (Veracity).

RDBMS data models are optimized for row-wise processing based on the assumption that applications
process most of an entitys attributes. In contrast,
some NoSQL data models are designed to eciently
perform column-wise processing to support computing
aggregates on one or two attributes across all entities.
For example, data models of column-oriented NoSQL
databases resemble a data warehouse star schema
where the fact table at the center stores de-normalized
row data and each dimension table stores all columns
of a column-family.
E. Application Synthesized Relationships
RDBMS data models are designed to capture one-toone, one-to-many, and many-to-many relationships between entities. In contrast, many NoSQL data models do
not explicitly and inherently model relationships (graph
data model based NoSQL systems are an exception).
Most NoSQL systems do not support relational join
operation. NoSQL applications are expected to synthesize relationships by processing the data. If the data
is inherently rich in relationships, many NoSQL data
models are a poor t.

B. Flexible Data Models


All RDBMS databases are based on the relational data
model. In contrast, each class of NoSQL system employs
a dierent data model. NoSQL data models inherently
support horizontal data partitioning across processors
in a cluster or even processors across data centers.
RDBMS applications typically work with xed
database schema, though the schema may evolve
slowly. Schema change is a well managed activity and
the application goes through rigorous software testing.
In contrast, many NoSQL systems treat schema as a construct that inherently evolves over time. These systems
also require schema that allows great variation in row
data without incurring the NULL value problems.
A NoSQL data model closely reects an individual
application need rather than needs of several applications. The data model naturally mirrors the way a

F. Optimized MapReduce Processing


In RDBMS applications, employing MapReduce processing requires moving the data from the RDBMS to
another memory address space within the same machine or to a dierent one. In contrast, some NoSQL
systems feature MapReduce as a native functionality to
obviate the data movement.

191
207

However, vertical scaling quickly reaches its saturation point. Other techniques for improving transaction
throughput include relaxing ACID compliance and using
methods such as disabling database logging.
Horizontal scalability, in contrast, refers to scaling
by adding new processors complete with their own
disks (aka nodes). Closely associated with horizontal
scaling is the partitioning of data. Each node/processor
contains only a subset of the data. The process of
assigning data to each node is referred to as sharding,
which is done automatically in some NoSQL systems.
It is easier to achieve horizontal scalability with
NoSQL databases. The complexity involved in enforcing ACID compliance simply does not exist for most
NoSQL systems as they are ACID non-compliant by
design. Furthermore, only insert and read operations
dominate in NoSQL databases as update and delete
operations fade into insignicance in terms of volume.
NoSQL systems also achieve horizontal scalability by
delegating two-phase commit required for transaction
implementation to applications. For example, MongoDB
does not guarantee ACID compliance for concurrent
operations. This may infrequently result in undesirable
outcomes such as phantom reads.

G. Native Support for Versioning


Mainstream, o-the-shelf RDBMS do not provide support for data versioning. However, an RDBMS data
model can be designed to support versioning of an
attribute by including two additional attributes (begin
date and end date). This is a convoluted approach with
strong performance penalties. In contrast, some NoSQL
data models are designed to provide native support for
versioning.
H. New Query Languages
RDBMS SQL queries in practice tend to be complex,
involve joins across multiple tables, consume substantial database resources to execute, and are expensive
to write, debug, and maintain. In contrast, many NoSQL
systems avoid SQL altogether. Some provide a SQL-like
query language such as XQuery, while others require
writing programs for querying the database.
NoSQL databases take a multi-faceted approach
to querying including allowing only simple queries
through interfaces such as REST API, caching data in
main memory, or using database stored procedures as
queries. For example, in document-oriented NoSQL systems, a database query is a JSON document (technically,
a JavaScript expression) specied across all the documents in the collection. For graph model based NoSQL
systems, queries involve traversing graph structures.
Interfaces for graph traversal include JavaScript-based
interface (JIG), and special query languages such as
SPARQL, RDFS++, and Cypher.

III. NoSQL Concepts


NoSQL databases draw upon several concepts and
techniques to realize exible data modeling and associated query languages, and horizontal scalability. These
concepts include but are not limited to shared-nothing
architecture, Hash trees, consistent hashing, REST API,
Protocol buers (Protobuf), Apache Thrift, JSON and
BSON, BASE, MapReduce, vector clocks, column family,
keyspace, memory-mapped les, and the CAP theorem.

I. Horizontal Scalability
RDBMS for Big Data applications are forced to use
distributed storage. Reasons for this include data that
exceeds the disk size limits, need for partitioned table
storage, recovery from hard disk failures through data
redundancy, and data replication to enable high availability. A typical RDBMS query requires join operations
across multiple tables. How fast a query is processed
is limited by how quickly data can be moved from
hard disks to primary storage, and the amount of data
moved. Database update operations are usually run as
database transactions. If the table data is partitioned,
update operations run slowly due to contention for
exclusive locks.
Securing and releasing write locks, coordinating this
across various disks, and ACID (atomicity, consistency,
isolation, and durability) compliance slows down transaction throughput. Long running transactions exacerbate the transaction throughput further. Vertical scaling
is often proposed as a solution to the transaction
throughput problem, which involves adding more CPUs,
more cores per CPU, and additional main memory.

A. Shared-Nothing Architecture
In this architecture, each node is self-sucient and
acts independently to remove single point of resource
contention. Nodes share neither memory nor disk storage. Database systems based on shared-nothing architecture can scale almost linearly by adding new
nodes assembled using inexpensive commodity hardware components. Data is distributed across the nodes
in a non-overlapping manner, which is called sharding.
Though this concept existed for long with its roots in
distributed databases, it gained prominence with the
advent of NoSQL systems.
B. Hash Trees and Consistent Hashing
A hash tree (aka Merkle tree) is a tree in which every
non-leaf node is labeled with the hash of the labels of
its child nodes. Hash trees enable ecient and secure
verication of data transmitted between computers for

192
208

map processes concurrently processing dierent parts


of data and each process produces (key, value) pairs.
The shard process (second step) acts as a barrier to
ensure that all mapper processes have completed their
work, collects the generated key-value pairs from each
mapper process, sorts them, and partitions the sorted
key-value pairs. Next, the shard process assigns each
partition to a dierent reduce process (third step). Each
reduce process essentially receives all related key-value
pairs and produces one result.

veracity. They are also used to eciently determine the


dierences between a document and its copy.
In traditional hashing, a change in the number of
slots in the hash table results in nearly all keys
remapped to new slots. If K is the number of keys and
n is the number of slots, consistent hashing guarantees
that on average no more than K/n keys are remapped
to new slots.
C. REST API, Protobuf and Apache Thrift
Representational State Transfer (REST) is an HTTP
API. It uses four HTTP methods GET, POST, PUT and
DELETE to execute dierent operations. Many NoSQL
systems enable client interaction through this API.
Protocol buers (Protobuf) is a method for eciently
serializing and transmitting structured data. It is used
to perform remote procedure calls (RPC) as well as for
storing data. Thrift is a software framework and an
interface denition language (IDL) for cross-language
services and API development.

G. Vector Clocks
A vector clock is an algorithm to reason about events
based on event timestamps. It is an extension of multiversion concurrency control (MVCC) used in RDBMS to
multiple servers. Each server keeps its copy of vector
clock. When servers send and receive messages among
themselves, vector clocks are incremented and attached
with messages. A partial order relationship is dened
based on server vector clocks, and is used to derive
causal relationships between database item updates.

D. JSON and BSON

H. Column Families and Keyspaces

JavaScript Object Notation (JSON) is a lightweight,


text-based, open standard format for exchanging data
between a server and a Web application. Though it
is originally derived from the JavaScript language, it
is a language-neutral data format. BSON is a format
for binary-coded serialization of JSON-like documents.
Compared to Protobuf, BSON is more exible with
schema but not as space ecient.

A column family is a collection of rows, and each row


can contain dierent number of columns. Row keys are
unique within a column family, but can be reused in
other column families. This enables storing unrelated
data about the same key in dierent column families.
Some NoSQL systems use the term column-family to
refer to a group of related columns within a row.
A keyspace is a data container. It is similar to the
schema concept in RDBMS. Keyspaces are used to group
column families together. Data replication is specied
at the keyspace level. Therefore, data with dierent
replication requirements reside in separate keyspaces.

E. BASE Properties
In RDBMS, the consistency property ensures that all
transactions transform a database from one valid state
to another. Once a transaction updates a database item,
all database clients (i.e., programs and users) will see
the same value for the updated item.
ACID properties are to RDBMS as BASE is to NoSQL
systems. BASE refers to basic availability, soft state, and
eventual consistency. Basic availability implies disconnected client operation and delayed synchronization,
and tolerance to temporary inconsistency and its implications. Soft state refers to state change without input,
which is required for eventual consistency. Eventual
consistency means that if no further updates are made
to an updated database item for long enough period of
time, all clients will see the same value for the updated
item.

I. Memory-Mapped Files
A memory-mapped le is a segment of virtual memory which is associated with an operating system le
or le-like resource (e.g., a device, shared memory) on
a byte-for-byte correspondence. Memory-mapped les
increase I/O performance especially for large les.
J. CAP Theorem
Consistency, availability, and partition tolerance
(CAP) are the three primary concerns that determine
which data management system is suitable for a given
application. Consistency feature guarantees that all
clients of a data management system have the same
view of data. Availability assures that all clients can always read and write. Finally, partition tolerance ensures
that the system works well with data that is distributed
across physical network partitions. The CAP theorem
states that it is impossible for any data management

F. Minimizing Data Movement with MapReduce


MapReduce is a computational paradigm for processing massive datasets in parallel if the computation
ts a pattern characterized by three steps: map, shard
and reduce. The map process involves several parallel

193
209

record access, and high speed insert and read operations. When a value changes, it is stored as a dierent
version of the same value using a timestamp. In other
words, the notion of update is eectively nonexistent.
Partial record access contributes to dramatic performance improvements for certain applications. Columnar databases perform aggregate operations such as
computing maxima, minima, average, and sum on large
datasets with extreme eciency.
Recall that a column family is a set of related
columns. Column databases require predening column families, and not columns. A column family may
contain any number of columns of any type of data,
as long as the latter can be persisted as byte arrays.
Columns in a family are logically related to each other,
and are physically stored together. Performance gain
is achieved by grouping columns with similar access
characteristics into the same family. Database schema
evolution is achieved by adding columns to column
families. A column family is similar to the column
concept in RDBMS.
Systems in this category include Google BigTable
(available through Google App Engine), Apache Cassandra, Apache HBase, Hypertable, Cloudata, Oracle RDBMS
Columnar Expression, Microsoft SQL Server 2012 Enterprise Edition. Table II summarizes characteristics of
some columnar databases.

system to achieve all these three features at the same


time. For example, to achieve partition tolerance, a
system may need to give up consistency or availability.
IV. Key-Value Databases
The dening characteristics of key-value databases
include real-time processing of Big Data, horizontal
scalability across nodes in a cluster or data centers,
reliability and high availability. Their use cases include
applications where response is needed in milliseconds.
They are used for session management for Web applications; conguration management; distributed locks;
messaging; personalization of user experience and providing engaging user interactions in social media, mobile platforms, and Internet gaming; real-time bidding
and online trading; Ad servers; recommendation engines; and multi-channel retailing and eCommerce.
As the name suggests, these systems store data as
key-value pairs or maps (aka dictionaries). Though all
of them store data as maps, they dier widely in functionality and performance. Some systems store data
ordered on the key, while others do not. Some keep
entire data in memory and while others persist data to
disk. Data can also be distributed across a cluster of
nodes.
Representative systems in this category include
Memcached, Aerospike, Redis, Riak, Kyoto Cabinet,
Membase, Amazon DynamoDB, CouchDB, BerkeleyDB,
EHCache, Apache Cassandra [6], and Voldermot. Table I summarizes characteristics of some key-value
databases.
A hallmark of key-value database is rapid applications development and extremely fast response times
even with commodity type processors. For example,
100K - 200K simple write/read operations have been
achieved with an Intel Core 2 Duo, 2.6 GHz processor.
In another case, replacement of a RDBMS with Redis for
user prole management (100K+ proles with over 10
lter attributes) was reduced from 3 - 6 sec to 50 ms.

VI. Graph Databases


A graph data model is at the heart of graph databases
[7]. In some applications, relationships between objects
is even more important than the objects themselves.
Relationships can be static or dynamic. Such data is
called connected data. Twitter, Facebook, Google, and
LinkedIn data are naturally modeled using graphs. In
addition, graph data models are used in other industries including airlines, freight companies, healthcare,
retail, gaming, oil and gas. Graph databases are also
popular for implementing access control and authorization subsystems for applications that serve millions of
end users.
Graph databases include FlockDB, InniteGraph, Titan, Microsoft Trinity, HyperGraphDB, AllegroGraph,
Anity, OrientDB, DEX, Facebook Open Graph, Google
Knowledge Graph, and Neo4J. Table III summarizes
characteristics of two graph databases.

V. Table-type/Column Databases
RDBMS are row-based systems as their processing
is row-centric. They are designed to eciently return
entire rows of data. Rows are uniquely identied by
system generated row ids. As the name implies, column
databases are column-centric. Conceptually, a columnar
database is like an RDBMS with an index on every
column without incurring the overhead associated with
the latter. It is also useful to think of column databases
as nested key-value systems.
Column database applications are characterized by
tolerance to temporary inconsistency, need for versioning, exible database schema, sparse data, partial

VII. Document Databases


These databases are not the document/full-text
databases in the traditional sense. They are also not
content management systems. Document databases are
used to manage semi-structured data mostly in the
form of key-value pairs packaged as JSON documents.

194
210

Table I
Key-Value Databases
Name

Salient Characteristics

Memcached

Shared-nothing architecture, in-memory object caching systems with no disk persistence. Automatic sharding but no
replication. Client libraries for popular programming languages including Java, .Net, PHP, Python, and Ruby.

Aerospike

Shared-nothing architecture, in-memory database with disk persistence. Automatic data partitioning and synchronous
replication. Data structures support for string, integer, BLOB, map, and list. ACID with relax option, backup and recovery,
high availability. Cross data center replication. Client access through Java, Lua, and REST.

Cassandra

Shared-nothing, master-master architecture, in-memory database with disk persistence. Key range based automatics data
partitioning. Synchronous and asynchronous replication across multiple data centers. High availability. Client interfaces
include Cassandra Query Language (CQL), Thrift, and MapReduce. Largest known Cassandra cluster has over 300 TB of
data in over 400-node cluster.

Redis

Shared-nothing architecture, in-memory database with disk persistence, ACID transactions. Supports several data
structures including sets, sorted sets, hashes, strings, and blocking queues. Backup and recovery. High availability.
Client interface through C and Lua.

Riak

Shared-nothing architecture, in-memory database with disk persistence, data teated as BLOBs, automatic data partitioning,
eventually consistency, backup and recovery, and high availability through multi data center replication. Client API
includes Erlang, JavaScript, MapReduce queries, full text search, and REST.

Voldemort

Shared-nothing architecture, in-memory database with disk persistence, automatic data partitioning and replication,
versioning, map and list data structures, ACID with relax option, backup and recovery, high availability. Protocol Buers,
Thrift, Avro and Java serialization options. Client access through Java API.

Table II
Column Databases
Name

Salient Characteristics

BigTable

A sparse, persistent, distributed, multi-dimensional sorted map. Features strict consistency and runs on distributed
commodity clusters. Ideal for data that is in the order of billions of rows with millions of columns.

HBase

Open Source Java implementation of BigTable. No data types, and everything is a byte array. Client access tools: shell
(i.e., command line), Java API, Thrift, REST, and Avo (binary protocol). Row keys are typically 64-bytes long. Rows are
byte-ordered by their row keys. Uses distributed deployment model. Works with Hadoop Distributed File System (HDFS),
but uses lesystem API to avoid strong coupling. HBase can also be used with CloudStore.

Cassandra

Provides eventual consistency. Client interfaces: phpcasa (a PHP wrapper), Pycassa (Python binding), command line/shell,
Thrift, and Cassandra Query Language (CQL). Popular for developing nancial services applications.

Table III
Graph Databases
Name

Salient Characteristics

Neo4J

In-memory or in-memory with persistence. Full support for transactions. Nodes/vertices in the graph are described using
properties and the relationships between nodes are typed and relationships can have their own properties. Deployed on
compute clusters in a single data center or across multiple geographically distributed data centers. Highly scalable and
existing applications have 32 billion nodes, 32 billion relationships, and 64 billion properties. Client interfaces: REST,
Cypher (SQL-like), Java, and Gremlin.

AllegroGraph

Full read concurrency, near full write concurrency, and dynamic and automatic indexing of committed data, soundex
support, ne granular security, geospatial and temporal reasoning, and social network analysis. Online backups, point-intime recovery, replication, and warm standby. Integration with Solr and MongoDB. Client interfaces: JavaScript, CLIF++,
and REST (Java Sesame, Java Jena, Python, C#, Clojure, Perl, Ruby, Scala, and Lisp).

195
211

and db4o is an open source system. Versant Object


Database targets applications with high performance
and concurrency requirements. Client access is provided through interfaces for Java, .NET and C++. Versant JPA is an object database with JPA 2.0 compliant
client interface. FastObjects primarily provides objectbased persistence for .NET applications.

Each document is an independent entity with potentially varied and nested attributes. Documents are indexed by their primary identiers as well as semistructured document eld values. Document databases
are ideal for applications that involve aggregates across
document collections.
Systems in this category include MongoDB, CouchDB,
Couchbase, RavenDB, and FatDB. Document databases
often integrate with full-text databases such as Solr,
Lucene, and ElasticSearch. For example, ElasticSearch
provides real-time response to document queries in
JSON format; RavenDB uses Lucene. Table IV summarizes characteristics of some document databases.

X. Hybrid Systems
Systems in this category are those that evolved
from the traditional RDBMS or those that fall into
more than one category discussed above. Systems in
this category include PostgreSQL, VoltDB, OrientDB,
Aerospike, ArangoDB, and Spanner. On Amazon EC2,
VoltDB achieved 95 thousand transactions per second
(TPS) in a Node.js application running on 12 nodes;
and 34 million TPS with 30 nodes. Table V summarizes
characteristics of some hybrid databases.

VIII. Native XML Databases


These databases store documents in native XML format. Query languages supported include XQuery, XSLT,
and XPath. Some provide support for JSON. Systems
in this category include BaseX, eXist, MarkLogic Server,
and Sedna.
The distinguishing feature of MarkLogic Server is
security. For this reason, it is widely used in security
industry. It provides client access through REST and
Java and JSON is supported. REST API is partitioned into
three categories: client API for CRUD and search; management API for instrumentation; and packaging API for
conguration. MarkLogic Server recently rebranded as
Enterprise NoSQL.

XI. Conclusions
Unprecedented data volumes, connected data, performance and scalability requirements of modern Big datadriven applications eectively challenged the practice
that RDBMS is the only approach for data management.
Consistency, availability, and partition tolerance are
the three primary concerns that determine which data
management system is suitable for a given application.
Netix moved from Oracle RDBMS to Apache Cassandra, and achieved over a million writes per second
across the cluster with over 10,000 writes per second
per node while maintaining the average latency at less
than 0.015 milliseconds. Total cost of Cassandra set
up and running on Amazon EC2 was at around $60 per
hour for a cluster of 48 nodes.
Cisco recently replaced an Oracle RAC solution for
master data management with Neo4J. Query times were
reduced from minutes to milliseconds in addition to
expressing queries on connected data with ease. This
application has 35 million nodes, 50 million relationships, and 600 million properties.
As the above two cases exemplify, NoSQL data models are designed for eciently supporting insert and
read operations; storing sparse data and column-wise
processing; enabling disconnected client operation and
delayed synchronization; and tolerance to temporary
inconsistency and its implications.
A more typical scenario for NoSQL systems to gain
ubiquitous use will come from using an array of data
management systems, each naturally suited for the type
of data and operations, all data management systems
abstracted away through a Web or application server.
Furthermore, Web servers will manage access control
and authorization centrally. Also, the migration from

IX. Native Object Databases


Object databases, as the name implies, use object
models to store data. These systems combine database
functionality with object-oriented programming language capabilities. They allow creating and modifying
objects within the database system. Some systems are
tightly integrated with an object-oriented programming
language, while others feature database-specic programming language. This eliminated the impedance
mismatch problem of RDBMS since the same data
model is used by the programming language and the
database. A subclass within object databases is called
object-relational. These systems combine traditional
RDBMS capabilities with object databases. Native object
databases are ideally suited for complex data often
found in computer-aided design, manufacturing, spatial, graphics, and multimedia applications.
Query languages for object databases are typically
declarative. Object database queries typically run much
faster since no RDBMS joins are involved. Object Query
Language (OQL) is the result of standardization eort
spearheaded by the Object Data Management Group.
Current systems in this category include Versant
Object Database, Versant JPA, FastObjects, db4o, and
WakandaDB. The rst three are marketed by Versant

196
212

Table IV
Document Databases
Name

Salient Characteristics

MongoDB

No transaction support. Only modier operations oer atomic consistency. Lack of isolation levels may result in phantom
reads. Uses memory-mapped les storage. Support is available for geospatial processing and MapReduce framework.
Indexing, replication, GridFS, and aggregation pipeline. JavaScript expressions as queries. Client access tools: JS Shell
(command line tool), and drivers for most programming languages. Suitable for applications that require auto-sharding,
high horizontal scalability for managing schema-less semi-structured documents. Stores documents in BSON format and
data is transferred across the wire in binary format.

CouchDB

Open Source database written in Erlang. JSON format for documents. Client access tools: REST API, CouchApps (an
application server), and MapReduce. JavaScript is used for writing MapReduce functions.

Couchbase

Incorporates functionality of CouchDB and Membase. Data is automatically partitioned across cluster nodes. All nodes
can do both reads and writes. Used in many commercial high availability applications and games.
Table V
Hybrid Databases

Name

Salient Characteristics

PostgreSQL

Recent versions provide support for JSON and key-value support through an add-on called hstore. Suitable for applications
that primarily depend on RDBMS for data management, but also face a need for scalable means for managing key-value
data.

VoltDB

In-memory database running on a single thread. Eliminates the overheads associated with locking and latching in multithreaded environments. Uses snapshots to save data to the disk. Database can be restored to a previous state using
snapshots. Data is distributed across several servers. Supports transactions. Supports only a subset of ANSI/ISO SQL.
Migration to VoltDB will require rewriting some of the existing SQL queries. Client interfaces: JSON API, Java, C++, C#,
PHP, Python, Node.js, Ruby, and Erlang.

VoltCache

A key-value database implemented on top of VoltDB. Client access is through a Memcached compatible API.

OrientDB

Provides document, key-value, and graph databases functionality.

ArangoDB

Provides document, key-value, and graph databases functionality.

Aerospike

Hybrid system like OrientDB, but also provides traditional RDBMS functionality.

Google Spanner

Multi-version, globally-distributed, and synchronously-replicated database. Spanner supports externally-consistent distributed transactions.

RDBMS to NoSQL systems is eased as the latter return


results in JSON format.
Though NoSQL systems are predominantly used for
new applications which are characterized by horizontal
scalability, high performance, relaxed and eventual consistency, it is also likely that existing applications will
begin to use NoSQL through reegineering process. The
current upheaval in the data management systems will
help promote using the system that closely matches
the application needs. Services such as Amazon EC2
will make NoSQL systems more economical and within
reach for all organizations, both small and big.

booktitle = Proceedings of the 8th ACM European Conference on Computer Systems, series = EuroSys 13, year =
2013, isbn = 978-1-4503-1994-2, location = Prague, Czech
Republic, pages = 183196, numpages = 14, publisher =
ACM, address = New York, NY, USA.
[4] D. McCreary and A. Kelly, Making Sense of NoSQL: A guide
for managers and the rest of us. Manning Publications,
2013.
[5] C. Mohan, History repeats itself: Sensible and NonsenSQL
aspects of the NoSQL hoopla, in Proceedings of the 16th
International Conference on Extending Database Technology, ser. EDBT 13. New York, NY, USA: ACM, 2013, pp.
1116.

References

[6] V. Abramova and J. Bernardino, NoSQL databases: MongoDB vs Cassandra, in Proceedings of the International
C* Conference on Computer Science and Software Engineering, ser. C3S2E 13. New York, NY, USA: ACM, 2013,
pp. 1422.

[1] R. Cattell, Scalable sql and nosql data stores, SIGMOD


Rec., vol. 39, no. 4, pp. 1227, May 2011.
[2] V. Benzaken, G. Castagna, K. Nguyen, and J. Simon,
Static and dynamic semantics of NoSQL languages,
SIGPLAN Not., vol. 48, no. 1, pp. 101114, Jan. 2013.

[7] I. Robinson, J. Webber, and E. Eifrem, Graph Databases.


OReilly, 2013.

[3] F. Cruz, F. Maia, M. Matos, R. Oliveira, J. a. Paulo, J. Pereira,


and R. Vilaa, MeT: Workload aware elasticity for NoSQL,

197
213

You might also like