Getting to know
by Michelle Darling
[email protected]
August 2013
Agenda:
What is Cassandra?
Installation, CQL3
Data Modelling
Summary
Only 15 min to cover these, so
please hold questions til the
end, or email me :-) and Ill
summarize Q&A for everyone.
Unfortunately, no time for:
DB Admin
Detailed Architecture
Partitioning /
Consistent Hashing
Consistency Tuning
Data Distribution &
Replication
System Tables
App Development
Using Python, Ruby etc
to access Cassandra
Using Hadoop to
stream data into
Cassandra
What is Cassandra?
NoSQL Distributed DB
Consistency - A__ID
Availability - High
Point of Failure - none
Good for Event
Tracking & Analysis
Fortuneteller of Doom
from Greek Mythology. Tried to
warn others about future disasters,
but no one listened. Unfortunately,
she was 100% accurate.
Time series data
Sensor device data
Social media analytics
Risk Analysis
Failure Prediction
Rackspace: Which servers
are under heavy load
and are about to crash?
The Evolution of Cassandra
2005
Data Model
Wide rows, sparse arrays
High performance through very
fast write throughput.
2006
Infrastructure
Peer-Peer Gossip
Key-Value Pairs
Tunable Consistency
Originally for Inbox Search
But now used for Instagram
2008: Open-Source Release / 2013: Enterprise & Community Editions
Other NoSQL vs.
NoSQL Taxonomy:
Key-Value Pairs
Dynamo, Riak, Redis
Column-Based
BigTable, HBase,
Cassandra
Document-Based
MongoDB, Couchbase
Graph
Neo4J
Cassandra
C* Differentiators:
Production-proven at
Netflix, eBay, Twitter,
20 of Fortune 100
Clear Winner in
Scalability,
Performance,
Availability
Big Data Capable
-- DataStax
Architecture
Cluster (ring)
Nodes (circles)
Peer-to-Peer Model
Gossip Protocol
Partitioner:
Consistent Hashing
Netflix
Streaming Video
Personalized
Recommendations per
family member
Built on Amazon Web
Services (AWS) +
Cassandra
Cloud installation using
Amazon Web Services (AWS)
Elastic Compute Cloud (EC2)
Free for the 1st year! Then pay only for what you use.
Sign up for AWS EC2 account: Big Data University Video 4:34 minutes,
Amazon Machine Image (AMI)
Preconfigured installation template
Choose: DataStax AMI for Cassandra
Community Edition
Follow these *very good* step-by-step
instructions from DataStax.
AMIs also available for CouchBase, MongoDB
(make sure you pick the free tier community versions to avoid
monthly charge$$!!!).
AWS EC2 Dashboard
DataStax AMI Setup
DataStax AMI Setup
--clustername Michelle
--totalnodes 1
--version community
Roll your Own Installation
DataStax Community Edition
Install instructions
For Linux, Windows,
MacOS:
http://www.datastax.com/2012/01/gettingstarted-with-cassandra
Video: Set up a 4node Cassandra
cluster in under 2
minutes
http://www.screenr.com/5G6
Invoke CQLSH, CREATE KEYSPACE
./bin/cqlsh
cqlsh> CREATE KEYSPACE big_data
with strategy_class = org.apache.cassandra.
locator.SimpleStrategy
with strategy_options:replication_factor=1;
cqlsh> use big_data;
cqlsh:big_data>
Tip: Skip Thrift -- use CQL3
Thrift RPC
CQL3
// Your Column
Column col = new Column(ByteBuffer.wrap("name".
getBytes()));
col.setValue(ByteBuffer.wrap("value".getBytes()));
col.setTimestamp(System.currentTimeMillis());
// Don't ask
ColumnOrSuperColumn cosc = new ColumnOrSuperColumn();
cosc.setColumn(col);
- Uses cqlsh
- SQL-like language
- Runs on top of Thrift RPC
- Much more user-friendly.
// Prepare to be amazed
Mutation mutation = new Mutation();
mutation.setColumnOrSuperColumn(cosc);
List<Mutation> mutations = new ArrayList<Mutation>();
mutations.add(mutation);
Map mutations_map = new HashMap<ByteBuffer, Map<String,
List<Mutation>>>();
Map cf_map = new HashMap<String, List<Mutation>>();
cf_map.set("Standard1", mutations);
mutations_map.put(ByteBuffer.wrap("key".getBytes()),
cf_map);
cassandra.batch_mutate(mutations_map,
consistency_level);
Thrift code on left
equals this in CQL3:
INSERT INTO (id, name)
VALUES ('key',
'value');
CREATE TABLE
cqlsh:big_data> create table user_tags (
user_id varchar,
tag varchar,
value counter,
primary key (user_id, tag)
):
TABLE user_tags: How many times has a user
mentioned a hashtag?
COUNTER datatype - Computes & stores counter value
at the time data is written. This optimizes query
performance.
UPDATE TABLE
SELECT FROM TABLE
cqlsh:big_data> UPDATE user_tags SET
value=value+1 WHERE user_id = paul AND tag =
cassandra
cqlsh:big_data> SELECT * FROM user_tags
user_id | tag
| value
--------+-----------+---------paul
| cassandra | 1
DATA MODELING
A Major Paradigm Shift!
RDBMS
Cassandra
Structured Data, Fixed Schema
Unstructured Data, Flexible Schema
Array of Arrays
2D: ROW x COLUMN
Nested Key-Value Pairs
3D: ROW Key x COLUMN key x COLUMN values
DATABASE
KEYSPACE
TABLE
TABLE a.k.a COLUMN FAMILY
ROW
ROW a.k.a PARTITION. Unit of replication.
COLUMN
COLUMN [Name, Value, Timestamp]. a.k.a CLUSTER. Unit
of storage. Up to 2 billion columns per row.
FOREIGN KEYS, JOINS,
ACID Consistency
Referential Integrity not enforced, so A_CID.
BUT relationships represented using COLLECTIONS.
Cassandra
3D+: Nested Objects
RDBMS
2D: Rows
x columns
Example:
Twissandra Web App
Twitter-Inspired
sample application
written in Python +
Cassandra.
Play with the app:
twissandra.com
Examine & learn
from the code on
GitHub.
Features/Queries:
Sign In, Sign Up
Post Tweet
Userline (Users tweets)
Timeline (All tweets)
Following (Users being
followed by user)
Followers (Users
following this user)
Twissandra.com vs Twitter.com
Twissandra - RDBMS Version
Entities
USER, TWEET
FOLLOWER, FOLLOWING
FRIENDS
Relationships:
USER has many TWEETs.
USER is a FOLLOWER of many
USERs.
Many USERs are FOLLOWING
USER.
Twissandra - Cassandra Version
Tip: Model tables to mirror queries.
TABLES or CFs
TWEET
USER, USERNAME
FOLLOWERS, FOLLOWING
USERLINE, TIMELINE
Notes:
Extra tables mirror queries.
Denormalized tables are
pre-formedfor faster
performance.
Tip: Remember,
Skip Thrift -- use CQL3
TABLE
What does C* data look like?
TABLE Userline
List all of users Tweets
*************
Row Key: user_id
Columns
Column Key: tweet_id
at Timestamp
TTL (Time to Live) seconds til expiration
date.
*************
Cassandra Data Model = LEGOs?
Flex
Sch ible
ema
Summary:
Go straight from SQL
to CQL3; skip Thrift, Column
Families, SuperColumns, etc
Denormalize tables to
mirror important queries.
Roughly 1 table per impt query.
Choose wisely:
Partition Keys
Cluster Keys
Indexes
TTL
Counters
Collections
See DataStax Music Service
Example
Consider hybrid
approach:
20% - RDBMS for highly
structured, OLTP, ACID
requirements.
80% - Scale Cassandra to
handle the rest of data.
Remember:
Cheap: storage,
servers, OpenSource
software.
Precious: User AND
Developer Happiness.
Resources
C* Summit 2013:
Slides
Cassandra at eBay Scale (slides)
Data Modelers Still Have Jobs Adjusting For the NoSQL
Environment (Slides)
Real-time Analytics using
Cassandra, Spark and Shark
slides
Cassandra By Example: Data
Modelling with CQL3 Slides
DATASTAX C*OLLEGE CREDIT:
DATA MODELLING FOR APACHE
CASSANDRA slides
I wish I found these 1st:
How do I Cassandra?
slides
Mobile version of
DataStax web docs
(link)