0% found this document useful (0 votes)
4 views72 pages

No SQL

NoSQL refers to non-relational database management systems that do not follow conventional RDBMS structures, allowing for more flexibility in handling large-scale and sparse data. It encompasses various types, including key-value stores, document stores, and column-oriented databases, each with unique querying methods and data models. The rise of NoSQL has been driven by the challenges faced by traditional RDBMS in accommodating the demands of modern web-scale applications.

Uploaded by

pulokr221
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views72 pages

No SQL

NoSQL refers to non-relational database management systems that do not follow conventional RDBMS structures, allowing for more flexibility in handling large-scale and sparse data. It encompasses various types, including key-value stores, document stores, and column-oriented databases, each with unique querying methods and data models. The rise of NoSQL has been driven by the challenges faced by traditional RDBMS in accommodating the demands of modern web-scale applications.

Uploaded by

pulokr221
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NoSQL

NoSQL
 NoSQL means?
 No + SQL?
 No relational DBMS
 NonRel, suggested alternative name
 Not only SQL
 Try to keep the relation with the RDMS
 Now, NoSQL stands for
 Db or data storage
 Do not follow conventional RDBMS

2 Advanced Database system principle


History
 Long long ago, there is . . .
 With the first batch of computers
 Hierarchical catalog:
 needs storage authorization and authentication
 Flourished in the era of
 Parallelized computation (DBMS)
 Distributed computation (DBMS)
 Reborn in Web scale applications
 AliBaba, FB, Weibo

3 Advanced Database system principle


Challenges to RDBMS (1)
 Do you know the first search engine?
 Archie: 1990
 McGill University, 3 students
 Search a file by filename
 First one to index internet ftp files
 Still not a real search engine
 Veronica (Gopher)
 Gopher, information search system
 By Nevada System Computing Services University, 1993
 Might be the first search engine
 Inktomi
 Could be seen as the second one
 Yahoo later use this search engine

4 Advanced Database system principle


Challenges to RDBMS (2)
 RDBMS is challenged a lot!
 From Inktomi to Google
 RDBMS suppose
 data attributes are well defined
 in a good structure
 has relations among data (entity)
 index could be built to speedup queries
 allows some irregular or missing structure
 However, is hard to handle large-scale, sparse data

5 Advanced Database system principle


Challenges to RDBMS (3)
 To cope with challenges, RDBMS could
 Denormalization
 Remove constraints
 Relax requirements on transactions
 But, after that, RDBMS is more like a NoSQL

6 Advanced Database system principle


NoSQL & Hadoop
 NoSQL is proposed, which could
 Reduce the difficulty to handle big, sparse data
 But, without the ability:
 guarantee transaction integrity
 flexible index & query

 The most missing property of NoSQL is


SQL !!!
 Google built and published many works
 The first open source software is released
 By the founder of Lucene - open source search engine
 2003, core group of Lucene joined Yahoo!
 Then, Hadoop is born! Core person Doug Cutting
7 Advanced Database system principle
Content
 Types of NoSQL DB
 Column-oriented DB
 K-value Store
 Document DBs
 Map-reduce

8 Advanced Database system principle


Types & Produces of NoSQL DB

 Key-value

 Graph database

 Document-oriented

 Column-oriented

9 Advanced Database system principle


NoSQL databases key-value

Name Producer Data model Querying

SimpleDB Amazon set of couples (key, {attribute}), restricted SQL; select, delete,
where attribute is a couple GetAttributes, and
(name, value) PutAttributes operations
Redis Salvatore set of couples (key, value), primitive operations for each
Sanfilippo where value is simple typed value type
value, list, ordered (according
to ranking) or unordered set,
hash value
Dynamo Amazon like SimpleDB simple get operation and put
in a context
Voldemort LinkeId like SimpleDB similar to Dynamo

10 Advanced Database system principle


NoSQL databases Column-oriented
Name Producer Data model Querying

BigTable Google set of couples (key, {value}) selection (by combination of


(in doubt) row, column, and time stamp
ranges)
HBase Apache groups of columns (a BigTable JRUBY IRB-based shell
clone) (similar to SQL)
Hypertable Hypertable like BigTable HQL (Hypertext Query
Language)
CASSANDRA Apache columns, groups of columns simple selections on key,
(originally corresponding to a key range queries, column or
Facebook) (supercolumns) columns ranges
PNUTS Yahoo (hashed or ordered) tables, selection and projection from a
typed arrays, flexible schema single table (retrieve an
arbitrary single record by
primary key, range queries,
complex predicates, ordering,
top-k)
11 Advanced Database system principle
NoSQL databases document-based

Name Producer Data model Querying

MongoDB 10gen object-structured manipulations with objects in


documents stored in collections (find object or
collections; objects via simple selections
each object has a primary and logical expressions,
key called ObjectId delete, update,)
Couchbase Couchbase1 document as a list of by key and key range, views
named (structured) items via Javascript and
(JSON document) MapReduce

1after merging Membase and CouchOne

12 Advanced Database system principle


Content
 Types of NoSQL DB
 Column-oriented DB
 K-value Store
 Document DBs
 Map-reduce

13 Advanced Database system principle


Column-oriented DB
 A table like this: (example from wiki)

EmpId Lastname Firstname Salary


1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000

14 Advanced Database system principle


Column-oriented DB
 Row-oriented DB (called row-stores)
 stores by row, as
 1,Smith,Joe,40000;
 2,Jones,Mary,50000;
 3,Johnson,Cathy,44000;

 Column-oriented DB (column-stores)
 Stores by column, as
 1,2,3;
 Smith,Jones,Johnson;
 Joe,Mary,Cathy;
 40000,50000,44000;
15 Advanced Database system principle
Column-oriented DB
 Advantage, or fits for
 Aggregates by values of certain columns
 Frequently modify some columns
 Can cope with temporal data
 Do not need to define schema in advance

 Column store cares more about disk storage


 But not the upper level data model (column family)

16 Advanced Database system principle


Column-oriented DB
 Frequently modify some columns
 A table in RDBMS like

EmpId name Salary


1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000

 Having millions of records, then split


name -> first name, last name
 Then have to alter table structure, which is unacceptable

17 Advanced Database system principle


Column-oriented DB
 Can cope with temporal data

Lastname Firstname street zip habit address

 Consider different timestamp

 Hard to handle in row-stores

 Easy to handle in column-stores


 Organized in column, with a time attribute

18 Advanced Database system principle


Column-oriented DB
 Do not need to define schema in advance

Lastname Firstname street zip habit address

 Row-stores needs to pre-define schema

 Column-stores do not
 Pre-defines column family
 A column family contains any # of columns
 Example can be:
 3 Column family: {name}, {address}, {preferences}

19 Advanced Database system principle


Column-oriented DB
 Friendly to sparse, mutated data
 Only valid column value can be stored
 Null will not be stored

 Design is a scalable one


 Single table is stored at multiple sites
 could be easily extended to have millions, billions of columns

20 Advanced Database system principle


Column-oriented DB
 With column family, column, timestamp, a record:
{
“row_key_i”:{
“name”:{
“firstname”: {
1:“Mary”,
5: “Angel”
}
}

}
21 Advanced Database system principle
}
Column-oriented DB
 Compression
 join

22 Advanced Database system principle


Benefits of Compression
 Reduces the size of Data

 Improve I/O Performance


 Reduce seek time:
data are stored nearer to each other.
 Reduce transfer time:
as there is less data
 Increase buffer hit rate:
buffer can hold larger fraction of data

23 Advanced Database system principle


High Compressibility
 Each attribute is stored in a separate column.

 Not only use traditional compression techniques


 Dictionary encoding, Huffman Encoding, etc

 But also use column-oriented techniques


 Run-Length Encoding

24 Advanced Database system principle


Run-Length Encoding
 Principles:
 A scanned row, compress same pixel with the color and its counter.
 Example:
 aaabccccccddeee, is compressed to
 3a1b6c2d3e
 Good for compressing images
with large, simple colored area
 Various algorithms are proposed based on this

25 Advanced Database system principle


How to Query compressed Column?
 De-compress the data

 Query the compressed column directly?


 For Run-Length Encoding

26 Advanced Database system principle


A Simple Example
 In column C1, the value “42” appears 1000 times
consecutively.

 Assume C1 is compressed by Run-Length Encoding

 A simple query: SUM(C1), can be directly got


 ==42 * 1000
 Match a key
 Extract its occurence

27 Advanced Database system principle


Compression-Aware Optimization
 Natural Join
 One input column is compressed by RLE
 Another input column is uncompressed
 Do the join directly
 Reduce the number of operations by a factor of k, where k is the
run-length of the RLE triple.
 Count

28 Advanced Database system principle


29 Advanced Database system principle
Content
 Types of NoSQL DB
 Column-oriented DB
 K-value Store
 Document DBs
 Map-reduce

30 Advanced Database system principle


Key-Value Store
 Key-Value Database
 queries faster
 store large volumes of data
 support high-concurrency
 very suitable for query by primary key
 Complex query conditions
Real-Time Search Engine

31 Advanced Database system principle


Amazon dynamo

32 Advanced Database system principle


Oracle NoSQL Database
 distributed key/value pair storage

 data is stored as key-value pairs


 which are written to particular storage nodes
 based on the hashed value of the primary key

33 Advanced Database system principle


Oracle NoSQL Database

34 Advanced Database system principle


Key-value Store
 A collection of Storage Nodes which host a set of Replication
Nodes.
 Given a traditional three-tier web architecture
 Takes the place of the back-end database
 Runs alongside it

35 Advanced Database system principle


Key-value Store

36 Advanced Database system principle


Content
 Types of NoSQL DB
 Column-oriented DB
 K-value Store
 Document DBs
 Map-reduce

37 Advanced Database system principle


An example of Document store
 To create a db for location-based services
 Create db prefs
 Create collection: location
 Codes:
use prefs
w = {name: “John Doe”, zip: 10001};
x = {name: “Don Doe”, zip: 10001};
db.location.save(w);
db.location.save(x);

38 Advanced Database system principle


An example of Document store
 use prefs : make prefs as the current db
 But db is not created yet
 Collection is never created
 Only after db.location.save(w);

 Query all records:


 Code: db.location.find();
 results:
{“_id”:ObjectedId(“45dfaft3daf…24B”), “name”: “John Doe”, “zip”:
10001};

39 Advanced Database system principle
An example of Document store
 Query by zip:
 Code: db.location.find({zip:10001});
 results:
{“_id”:ObjectedId(“45dfaft3daf…24B”), “name”: “John Doe”, “zip”:
10001};
{“_id”:ObjectedId(“4cdfsf33daf…24B”), “name”: “Don Doe”, “zip”:
10001};

40 Advanced Database system principle


An example of Document store
 ObjectedId uniquely identify a document

 Stored in _id

 Both are maintained by developers

41 Advanced Database system principle


Document store
 A MongoDB deployment hosts a number of databases.
 A database holds a set of collections.
 A collection holds a set of documents.
 A document is a set of key-value pairs.
 The document structure in MongoDB are BSON objects with
support for the full range of BSON types

42 Advanced Database system principle


BSON
 BSON: Binary Serialized Document Format
 A kind of storage format
 Like JSON: support document/array object
 Difference: support data type: Date,BinData
 Advantage: schema-less, flexible
 Disadvantage: space waste
 Example: {"hello":"world"}
 Key name: “hello”
 cstring::= (byte*) "/x00"
 *: 0-> n, byte /x00: end of key
43 Advanced Database system principle
Term mapping

44 Advanced Database system principle


Schema design
 MongoDB: embedding and link
 Embedding:
 the nesting of objects and arrays,
 inside a BSON document(prejoined).
 Links:
 references between documents(client-side follow-up query).
 "contains" relationships, one to many; duplication of data, many
to many

45 Advanced Database system principle


Schema design

46 Advanced Database system principle


Schema design

47 Advanced Database system principle


Content
 Types of NoSQL DB
 Column-oriented DB
 K-value Store
 Document DBs
 Map-reduce

48 Advanced Database system principle


Map Reduce
 Input & Output: a set of key/value pairs
Programmer specifies two functions:
 map (in_key, in_value) -> list(out_key, intermediate_value)
 Processes input key/value pair
 Produces set of intermediate pairs
 reduce (out_key, list(intermediate_value)) -> list(out_value)
 Combines all intermediate values for a particular key
 Produces a set of merged output values (usually just one)

49 Advanced Database system principle


Example: Count word occurrences
 map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");

 reduce(String output_key, Iterator intermediate_values):


// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));

50 Advanced Database system principle


Implementation Environment- Google
 Large clusters of commodity PCs connected together with
switched Ethernet
 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
 Commodity networking hardware: 100 MB/s or 1GB/s.
 Storage is on local IDE disks
 GFS: distributed file system manages data (SOSP'03)
 Job scheduling system: jobs made up of tasks, scheduler assigns tasks
to machines
 Implementation is a C++ library linked into user programs

51 Advanced Database system principle


Distributed Execution Overview
User
Program

fork fork fork

assign Master
assign
map
reduce

Input Data Worker


write Output
local Worker File 0
Split 0 read write
Split 1 Worker
Split 2 Output
Worker File 1
Worker remote
read,
sort
52 Advanced Database system principle
Dynamo:
Amazon’s Highly Available Key-value Store

53 Advanced Database system principle


Existing storage techniques
 A number of
 Famous one is simple DB
 Full name:
 Amazon Simple Storage Service
 Known as Amazon S3
 In Amazon, many applications only need primary-key access
 Best seller list
 Shopping carts
 Customer preferences
 Session management …

54 Advanced Database system principle


Dynamo
 Provides:
 Good scalability, high availability
 Through a synthesis of:
 Partitioning, replicating (consistent hashing)
 Consistency (versioning)
 Consistency (quorum-like protocol)
 Failure detection (gossip based algorithm)
…
 A decentralized one
 Storage nodes can be added, removed without manual
partitioning or redistribution.
55 Advanced Database system principle
System Assumption and Requirements
 Query model:
 Simple read and write to a data item
 Item is uniquely identified by a key
 No operations span multiple data items
 ACID (atomicity, consistency, isolation, durability) properties:
 Weak consistency, as it results in high availability
 Permits only single key updates

56 Advanced Database system principle


Service Level Agreements (SLA)
 To provide functionality in a bounded time:

 Most of systems measuring SLA, in terms of


 Mean, median value

 Amazon performs 99.9% of the distribution:


 To provide a better experience

57 Advanced Database system principle


Design Considerations
 Strong consistency cause low availability:
 Most data replication algorithms synchronize data replicas to
provide strong consistency
 Indicates program have to wait, until a correct answer could be
made to a query
 Therefore, cause low availability

 Dynamo is designed to be eventually consistent, i.e., all


updates reach all replicas eventually.

58 Advanced Database system principle


Design Considerations
 Need to consider:
 When to process update conflicts
 Who to perform the process
 When?
 Conflicts should be resolved during either reads or writes
 If resolve during writes, read is simple, but writes may be
rejected, within a given time
 Dynamo tries to provide “always writeable” data store, process
it during reads
 Shop cart: over 10m request, 3m checkout per day
 Writes are never rejected

59 Advanced Database system principle


Design Considerations
 Who?
 Done by db store or applications
 If done by db store
 Limited choices (simple), such as
 Last write wins
 If done by application
 Can merge different versions of conflicts, return a single unified
result
 Shop cart
 Could have other strategies

60 Advanced Database system principle


Design Considerations
 Incremental scalability
 Scale out one storage node, with minimal impact on the system
 symmetry
 Each node should have the same set of responsibilities as its
peers
 No distinguishable node is allowed
 Easy for maintenance
 Decentralization
 Centralization control may cause outage
 Need a peer-to-peer decentralized control

61 Advanced Database system principle


Other Issues
 All nodes are trusted
 Do not support hierarchical namespaces
 Nor complex schema
 Latency sensitive system
 99.9% reads, writes operations
 Finished within 100 milliseconds
 To meet latency requirement
 Avoid routing requests through multiple nodes
 Multiple hop routing increase response time
 Dynamo is 0-hop, each node maintains enough routing info, route a
request to appropriate node directly.
62 Advanced Database system principle
System architecture
 Should be scalable and robust
 Have following components:
 Load balancing
 Membership and failure detection
 Replica synchronization
 Overload handling
 State transfer
 Concurrency and job scheduling
 Request marshalling
 Request routing
…
63 Advanced Database system principle
System interface
 Get() and put()
 Get(key)
 Locates the object (or replicas) associated with the key in the
storage system
 Returns a list of objects with conflicting versions, along with a
context.
 Context
 Encodes system metadata about the object
 Opaque to caller
 Includes info such as versions
 Stores together with the objects

64 Advanced Database system principle


System interface
 put(key, context, object)
 Determines where the replicas of the object should be placed
 Based on the key value
 Writes the replicas to disk

65 Advanced Database system principle


Partitioning Algorithm
 For incremental scalability
 Need to partition data dynamically
 Consistent hashing:
 to partition data over all available data nodes

66 Advanced Database system principle


Consistent Hashing
 Out range of hash function:
a fixed ring
 i.e., the largest hash value wraps around
the smallest one
 Each node -> a random value, indicate its
position
 Item is hashed (by key)
get a position on the ring
 Then it walks along the ring clockwise, to
find the first node with position larger
than the item’s position

67 Advanced Database system principle


Advantage of Consistent Hashing
 Each node is responsible for the region in the ring between it
and its predecessor node
 Departure or arrival of node
 Only affects its immediate neighbors
 Other nodes remain unaffected
 Challenges:
 The random assigned value, may cause
 non-uniform data, non-uniform load distribution

 Resolution
 mapping a node to multiple points, instead of one
 To this end, use virtual nodes
 A node corresponds to multiple virtual nodes
68 Advanced Database system principle
Strategies adopted by Dynamo
 Strategy 1: T random tokens per node and partition by token
value
 Each node is assigned with T tokens (position in the ring)
 Chose uniformly at random in the ring
 Tokens are ordered by their values in the ring
 Every two consecutive tokens form a range
 The last token and
the first token form
the wrapping range
 A,B,C, 3 unique
preferred nodes for k1
69 Advanced Database system principle
Strategies adopted by Dynamo
 Strategy 2: T random tokens per node and equal-sized
partitions:
 The hash space is divided into Q equally sized ranges,
Q>> S*T, Q>>N
S: # of nodes in the system
 Tokens are used to build function, mapping the hash space to
the ordered list of nodes.

70 Advanced Database system principle


Strategies adopted by Dynamo
 Strategy 3: Q/S tokens per node, equal-sized partitions:
 Similar to S2, the hash space is divided into Q equally sized
ranges,
 Each node is assigned Q/S tokens
 A node leaves system, its tokens are randomly distributed to the
remaining nodes such that the properties keeps.

71 Advanced Database system principle


End of Chapter

You might also like