NDBI040: Modern Database Concepts
h p://[Link].mff.[Link]/~svoboda/courses/191-NDBI040/
Lecture 1
Introduc on
Mar n Svoboda
[email protected]ff.[Link]
1. 10. 2019
Charles University, Faculty of Mathema cs and Physics
Lecture Outline
Big Data
• Characteris cs
• Current trends
NoSQL databases
• Mo va on
• Features
Overview of NoSQL database types
• Key-value, wide column, document, graph, …
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 2
What is Big Data?
No standard defini on
• Gartner (research and advisory company):
High Performance Compu ng
Big Data is high volume, high velocity, and/or high variety
informa on assets that require new forms of processing to
enable enhanced decision making, insight discovery and pro-
cess op miza on.
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 4
Where is Big Data?
Sources of Big Data
• Social media and networks
…all of us are genera ng data
• Scien fic instruments
…collec ng all sorts of data
• Mobile devices
…tracking all objects all the me
• Sensor technology and networks
…measuring all kinds of data
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 5
Big Data Characteris cs
Volume (Scale)
Source: h p://[Link]/
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 6
Big Data Characteris cs
Variety (Complexity)
Source: h p://[Link]/
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 7
Big Data Characteris cs
Velocity (Speed)
Source: h p://[Link]/
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 8
Big Data Characteris cs
Veracity (Uncertainty)
Source: h p://[Link]/
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 9
Big Data Characteris cs
Basic 4V
• Volume (Scale)
Data volume is increasing exponen ally, not linearly
Even large amounts of small data can result into Big Data
• Variety (Complexity)
Various formats, types, and structures
(from semi-structured XML to unstructured mul media)
• Velocity (Speed)
Data is being generated fast and needs to be processed fast
• Veracity (Uncertainty)
Uncertainty due to inconsistency, incompleteness, latency,
ambigui es, or approxima ons
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 10
Rela onal Databases
Data model
Instance → database → table → row
Query languages
• Real-world: SQL (Structured Query Language)
• Formal: Rela onal algebra, rela onal calculi (domain, tuple)
Query pa erns
• Selec on based on complex condi ons, projec on, joins,
aggrega on, deriva on of new values, recursive queries, …
Representa ves
• Oracle Database, Microso SQL Server, IBM DB2
• MySQL, PostgreSQL
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 13
Rela onal Databases
Representa ves
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 14
Rela onal Databases
Features: Normal Forms
Model
• Func onal dependencies
• 1NF, 2NF, 3NF, BCNF (Boyce-Codd normal form)
Objec ve
• Normaliza on of database schema to BCNF or 3NF
• Algorithms: decomposi on or synthesis
Mo va on
• Diminish data redundancy, prevent update anomalies
• However:
Data is scattered into small pieces (high granularity), and so
these pieces have to be joined back together when querying!
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 15
Rela onal Databases
Features: Transac ons
Model
• Transac on = flat sequence of database opera ons
(READ, WRITE, COMMIT, ABORT)
Objec ves
• Enforcement of ACID proper es
• Efficient parallel / concurrent execu on (slow hard drives, …)
ACID proper es
• Atomicity – par al execu on is not allowed (all or nothing)
• Consistency – transac ons turn one valid database state into another
• Isola on – uncommi ed effects are concealed among transac ons
• Durability – effects of commi ed transac ons are permanent
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 16
Current Trends
Big Data
• Volume: terabytes → ze abytes
• Variety: structured → structured and unstructured data
• Velocity: batch processing → streaming data
• …
Big users
• Popula on online, hours spent online, devices online, …
• Rapidly growing companies / web applica ons
Even millions of users within a few months
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 17
Current Trends
Everything is in cloud
• SaaS: So ware as a Service
• PaaS: Pla orm as a Service
• IaaS: Infrastructure as a Service
Processing paradigms
• OLTP: Online Transac on Processing
• OLAP: Online Analy cal Processing
• …but also…
• RTAP: Real-Time Analy cal Processing
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 18
Current Trends
Data assump ons
• Data format is becoming unknown or inconsistent
• Linear growth → unpredictable exponen al growth
• Read requests o en prevail write requests
• Data updates are no longer frequent
• Data is expected to be replaced
• Strong consistency is no longer mission-cri cal
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 19
Current Trends
⇒ New approach is required
• Rela onal databases simply do not follow the current trends
Key technologies
• Distributed file systems
• MapReduce and other programming models
• Grid compu ng, cloud compu ng
• NoSQL databases
• Data warehouses
• Large scale machine learning
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 20
NoSQL Databases
What does NoSQL actually mean?
A bit of history …
• 1998
First used for a rela onal database that omi ed usage of SQL
• 2009
First used during a conference to advocate non-rela onal
databases
So?
• Not: no to SQL
• Not: not only SQL
• NoSQL is an accidental term with no precise defini on
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 21
NoSQL Databases
What does NoSQL actually mean?
NoSQL movement = The whole point of seeking alterna ves
is that you need to solve a problem that rela onal databases
are a bad fit for
NoSQL databases = Next genera on databases mostly ad-
dressing some of the points: being non-rela onal, dis-
tributed, open-source and horizontally scalable. The original
inten on has been modern web-scale databases. O en more
characteris cs apply as: schema-free, easy replica on sup-
port, simple API, eventually consistent, a huge data amount,
and more.
Source: h p://[Link]/
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 22
Types of NoSQL Databases
Core types
• Key-value stores
• Wide column (column family, column oriented, …) stores
• Document stores
• Graph databases
Non-core types
• Object databases
• Na ve XML databases
• RDF stores
• …
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 23
Key-Value Stores
Data model
• The most simple NoSQL database type
Works as a simple hash table (mapping)
• Key-value pairs
Key (id, iden fier, primary key)
Value: binary object, black box for the database system
Query pa erns
• Create, update or remove value for a given key
• Get value for a given key
Characteris cs
• Simple model ⇒ great performance, easily scaled, …
• Simple model ⇒ not for complex queries nor complex data
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 24
Key-Value Stores
Suitable use cases
• Session data, user profiles, user preferences, shopping carts, …
I.e. when values are only accessed via keys
When not to use
• Rela onships among en es
• Queries requiring access to the content of the value part
• Set opera ons involving mul ple key-value pairs
Representa ves
• Redis, MemcachedDB, Riak KV, Hazelcast, Ehcache, Amazon
SimpleDB, Berkeley DB, Oracle NoSQL, Infinispan, LevelDB,
Ignite, Project Voldemort
• Mul -model: OrientDB, ArangoDB
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 25
Key-Value Stores
Representa ves
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 26
Document Stores
Data model
• Documents
Self-describing
Hierarchical tree structures (JSON, XML, …)
– Scalar values, maps, lists, sets, nested documents, …
Iden fied by a unique iden fier (key, …)
• Documents are organized into collec ons
Query pa erns
• Create, update or remove a document
• Retrieve documents according to complex query condi ons
Observa on
• Extended key-value stores where the value part is examinable!
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 27
Document Stores
Suitable use cases
• Event logging, content management systems, blogs, web
analy cs, e-commerce applica ons, …
I.e. for structured documents with similar schema
When not to use
• Set opera ons involving mul ple documents
• Design of document structure is constantly changing
I.e. when the required level of granularity would outbalance
the advantages of aggregates
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 28
Document Stores
Representa ves
• MongoDB, Couchbase, Amazon DynamoDB, CouchDB,
RethinkDB, RavenDB, Terrastore
• Mul -model: MarkLogic, OrientDB, OpenLink Virtuoso,
ArangoDB
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 29
Document Stores
Representa ves
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 30
Wide Column Stores
Data model
• Column family (table)
Table is a collec on of similar rows (not necessarily iden cal)
• Row
Row is a collec on of columns
– Should encompass a group of data that is accessed together
Associated with a unique row key
• Column
Column consists of a column name and column value
(and possibly other metadata records)
Scalar values, but also flat sets, lists or maps may be allowed
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 31
Wide Column Stores
Query pa erns
• Create, update or remove a row within a given column family
• Select rows according to a row key or simple condi ons
Warning
• Wide column stores are not just a special kind of RDBMSs
with a variable set of columns!
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 32
Wide Column Stores
Suitable use cases
• Event logging, content management systems, blogs, …
I.e. for structured flat data with similar schema
When not to use
• ACID transac ons are required
• Complex queries: aggrega on (SUM, AVG, …), joining, …
• Early prototypes: i.e. when database design may change
Representa ves
• Apache Cassandra, Apache HBase, Apache Accumulo,
Hypertable, Google Bigtable
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 33
Wide Column Stores
Representa ves
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 34
Graph Databases
Data model
• Property graphs
Directed / undirected graphs, i.e. collec ons of …
– nodes (ver ces) for real-world en es, and
– rela onships (edges) between these nodes
Both the nodes and rela onships can be associated
with addi onal proper es
Types of databases
• Non-transac onal = small number of very large graphs
• Transac onal = large number of small graphs
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 35
Graph Databases
Query pa erns
• Create, update or remove a node / rela onship in a graph
• Graph algorithms (shortest paths, spanning trees, …)
• General graph traversals
• Sub-graph queries or super-graph queries
• Similarity based queries (approximate matching)
Representa ves
• Neo4j, Titan, Apache Giraph, InfiniteGraph, FlockDB
• Mul -model: OrientDB, OpenLink Virtuoso, ArangoDB
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 36
Graph Databases
Suitable use cases
• Social networks, rou ng, dispatch, and loca on-based
services, recommenda on engines, chemical compounds,
biological pathways, linguis c trees, …
I.e. simply for graph structures
When not to use
• Extensive batch opera ons are required
Mul ple nodes / rela onships are to be affected
• Only too large graphs to be stored
Graph distribu on is difficult or impossible at all
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 37
Graph Databases
Representa ves
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 38
Na ve XML Databases
Data model
• XML documents
Tree structure with nested elements, a ributes, and text values
(beside other less important constructs)
Documents are organized into collec ons
Query languages
• XPath: XML Path Language (naviga on)
• XQuery: XML Query Language (querying)
• XSLT: XSL Transforma ons (transforma on)
Representa ves
• Sedna, Tamino, BaseX, eXist-db
• Mul -model: MarkLogic, OpenLink Virtuoso
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 39
Na ve XML Databases
Representa ves
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 40
RDF Stores
Data model
• RDF triples
Components: subject, predicate, and object
Each triple represents a statement about a real-world en ty
• Triples can be viewed as graphs
Ver ces for subjects and objects
Edges directly correspond to individual statements
Query language
• SPARQL: SPARQL Protocol and RDF Query Language
Representa ves
• Apache Jena, rdf4j (Sesame), Algebraix
• Mul -model: MarkLogic, OpenLink Virtuoso
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 41
RDF Stores
Representa ves
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 42
Features of NoSQL Databases
Data model
• Tradi onal approach: rela onal model
• (New) possibili es:
Key-value, document, wide column, graph
Object, XML, RDF, …
• Goal
Respect the real-world nature of data
(i.e. data structure and mutual rela onships)
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 43
Features of NoSQL Databases
Aggregate structure
• Aggregate defini on
Data unit with a complex structure
Collec on of related data pieces we wish to treat as a unit
(with respect to data manipula on and data consistency)
• Examples
Value part of key-value pairs in key-value stores
Document in document stores
Row of a column family in wide column stores
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 44
Features of NoSQL Databases
Aggregate structure
• Types of systems
Aggregate-ignorant: rela onal, graph
– It is not a bad thing, it is a feature
Aggregate-oriented: key-value, document, wide column
• Design notes
No universal strategy how to draw aggregate boundaries
Atomicity of database opera ons:
just a single aggregate at a me
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 45
Features of NoSQL Databases
Elas c scaling
• Tradi onal approach: scaling-up - Vertical Scaling.
Buying bigger servers as database load increases
• New approach: scaling-out - Horizontal Scaling.
Distribu ng database data across mul ple hosts
– Graph databases (unfortunately): difficult or impossible at all
Data distribu on
• Sharding
Par cular ways how database data is split into separate groups
• Replica on
Maintaining several data copies (performance, recovery)
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 46
Features of NoSQL Databases
Automated processes
• Tradi onal approach
Expensive and highly trained database administrators
• New approach: automa c recovery, distribu on, tuning, …
Relaxed consistency
• Tradi onal approach
Strong consistency (ACID proper es and transac ons)
• New approach
Eventual consistency only (BASE proper es)
I.e. we have to make trade-offs because of the data distribu on
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 47
Features of NoSQL Databases
Schemalessness
• Rela onal databases
Database schema present and strictly enforced
• NoSQL databases
Relaxed schema or completely missing
Consequences: higher flexibility
– Dealing with non-uniform data
– Structural changes cause no overhead
However: there is (usually) an implicit schema
– We must know the data structure at the applica on level
anyway
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 48
Features of NoSQL Databases
Open source
• O en community and enterprise versions (with extended
features or extent of support)
Simple APIs
• O en state-less applica on interfaces (HTTP)
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 49
Features of NoSQL Databases
Current State: Five advantages
• Scaling
Horizontal distribu on of data among hosts
• Volume
High volumes of data that cannot be handled by RDBMS
• Administrators
No longer needed because of the automated maintenance
• Economics
Usage of cheap commodity servers, lower overall costs
• Flexibility
Relaxed or missing data schema, easier design changes
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 50
Features of NoSQL Databases
Current State: Five challenges
• Maturity
O en s ll in pre-produc on phase with key features missing
• Support
Mostly open source, limited sources of credibility
• Administra on
Some mes rela vely difficult to install and maintain
• Analy cs
Missing support for business intelligence and ad-hoc querying
• Exper se
S ll low number of NoSQL experts available in the market
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 51
Conclusion
The end of rela onal databases?
• Certainly no
They are s ll suitable for most projects
Familiarity, stability, feature set, available support, …
• However, we should also consider different database models
and systems
Polyglot persistence = usage of different data stores
in different circumstances
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 52
Lecture Conclusion
Big Data
• 4V characteris cs: volume, variety, velocity, veracity
NoSQL databases
• (New) logical models
Core: key-value, wide column, document, graph
Non-core: XML, RDF, …
• (New) principles and features
Horizontal scaling, data sharding and replica on, eventual
consistency, …
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 53
Course Overview
Outline and Objec ves
Principles
• Scaling, distribu on, consistency
• Transac ons, visualiza on, …
Technologies
• MapReduce programming model
Apache Hadoop
• Data formats
XML, JSON, RDF, …
• NoSQL databases
Core: RiakKV, Redis, MongoDB, Cassandra, Neo4j
Non-core: XML, RDF
Data models, query languages, …
NDBI040: Modern Database Concepts | Lecture 1: Introduc on | 1. 10. 2019 54