0% found this document useful (0 votes)
5 views35 pages

Chap 2 Bigdata-Nosql Completed

The document discusses the concept of Big Data, defining it as datasets that exceed the capabilities of traditional database systems in terms of size, variety, and velocity. It introduces the '3Vs' of Big Data: Volume, Variety, and Velocity, along with a fourth V, Veracity, which pertains to data quality. Additionally, it highlights the transition from relational database management systems (RDBMS) to NoSQL solutions, driven by the need for scalability and flexibility in handling large and diverse data sets.

Uploaded by

Mohi Gpt4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views35 pages

Chap 2 Bigdata-Nosql Completed

The document discusses the concept of Big Data, defining it as datasets that exceed the capabilities of traditional database systems in terms of size, variety, and velocity. It introduces the '3Vs' of Big Data: Volume, Variety, and Velocity, along with a fourth V, Veracity, which pertains to data quality. Additionally, it highlights the transition from relational database management systems (RDBMS) to NoSQL solutions, driven by the need for scalability and flexibility in handling large and diverse data sets.

Uploaded by

Mohi Gpt4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

M314 - Big Data / R.

Abid @UM6P-CC

CHAP-2
B I G DATA / N O SQ L: WH AT ? / WH E R E ?

1
W H AT I S B I G DATA ?

• “In 2010 the term ‘Big Data’ was virtually


unknown, but by mid-2011 it was being
widely touted as the latest trend, with all
the usual hype. Like ‘cloud computing’
before it, the term has today been adopted
by everyone, from product vendors to
large-scale outsourcing and cloud service
providers keen to promote their offerings.
But what really is Big Data?”

• Source: Fujitsu White Book on Big Data (available @Course Portal)

M314 - Big Data / R. Abid @UM6P-CC 2


W H AT I S B I G DATA ?
• According to McKinsey:
• “Big Data refers to datasets whose size are beyond the ability of typical
database software tools to capture, store, manage and analyze”
• According to IDC:
• “Big Data is a new generation of technologies and architectures designed to extract value
economically from very large volumes of a wide variety of data by enabling high velocity
capture, discovery and analysis”
• According to O’Reilly:
• “Big data is data that exceeds the processing capacity of conventional database systems. The
data is too big, moves too fast, or does not fit the structures of existing database architectures.
To gain value from these data, there must be an alternative way to process it”
☞ What are the Typical/Conventioanl DB software we are talking about here?

M314 - Big Data / R. Abid @UM6P-CC 3


THE ‘3VS’
OF BIG
DATA

4
M314 - Big Data / R. Abid @UM6P-CC
I. VOLUME

Volume = Size

5
M314 - Big Data / R. Abid @UM6P-CC
I. VOLUME
CHALLENGES?
• Exponential Data Growth
• Scaling up/down

• Capturing
• Means?

• Storage
• Distributed Storage
• ☞ often with Replicas

• Retrieval / Access
• Time?

• Processing
• Means?
• Real-time or Batch Processing?
• Cost?

M314 - Big Data / R. Abid @UM6P-CC 6


I I . VA R I E T Y

• Variety = Complexity

• Data were confined only to Tables

• Today, Data are more heterogeneous


• Different sources, formats, etc.

M314 - Big Data / R. Abid @UM6P-CC 7


II. AXES OF
DATA VA R I E T Y

• Variety with one type – Think of an email Collection:


• Sender, receiver, date... - Well-structured
• Body of the email - Text
• Attachments - Multi-Media
• Who-sends-to-whom - Network
• A current email cannot reference
a past email - Semantics
• Real-time - Availability

M314 - Big Data / R. Abid @UM6P-CC 8


I I . I M PAC T S O F
DATA VA R I E T Y
• Harder to ingest

• Difficult to create common storage

• Difficult to compare and match data across


variety

• Difficult to integrate

M314 - Big Data / R. Abid @UM6P-CC 9


III. VELOCITY
• Velocity = Speed

• Speed in creating Data


• Speed in storing Data
• Speed in processing Data

• Speed ☞ Real-time Action


• SCADA, Health, Stock Exchange, etc.

• Late Action ☞ Missing Opportunity

10
M314 - Big Data / R. Abid @UM6P-CC
III. VELOCITY
REAL-TIME PROCESSING
VS. BATCH PROCESSING

• Streaming Data = ‘What’s going on right now?”

• Streaming Data + Real-Time Processing

• = ☞ Agile & Adaptable (Business) Decision-Making

M314 - Big Data / R. Abid @UM6P-CC 11


III. VELOCITY
SPEED OF DATA GENERATION AND/OR DATA
PROCESSING?
WHICH PATH TO CHOOSE?

12
M314 - Big Data / R. Abid @UM6P-CC
THE 4TH V -
V E R AC I T Y

• Veracity = Quality (Validity, Volatility)

• Accuracy of data

• Reliability of the data source

• Context within Analysis

13
M314 - Big Data / R. Abid @UM6P-CC
I V. V E R A C I T Y
( VA L I D I T Y, V O L A T I L T I Y )
• Data veracity has given rise to two other big V’s of Big Data:

• Validity:
• means that the data is correct and accurate for the
intended use

• Volatility:
• refers to the rate of change and lifetime of the data

14
M314 - Big Data / R. Abid @UM6P-CC
V. T H E VA L U E
THE MOST
IMPORTANT V OF
ALL
• Value = (Actionable) Insight

• Just having Big Data is of no use unless we can turn


it into value
• Extract useful Information/Knowledge ☞
Mostly of ‘Economic/Political Value’
• Companies, organizations, and states are starting
to generate amazing value from their Big Data!

• How to get Value out of Big Data?

• ☞ DATA SCIENCE

15
M314 - Big Data / R. Abid @UM6P-CC
TOWARDS NOSQL-
H I S T O R I C A L R E V I E W O F DATA B A S E
STORAGE AND MGT MODELS
1. The first standard: Hierarchical 2. RDBMS and SQL
Databases • ACID Characteristics
• Tree-like hierarchy
3. Data in the Cloud
• When querying, you navigate
• DBaaS
down the tree
• Storage as a Service (Amazon S3)
• Quite similar to current B-Tree
• B-Tree index is the current default 4. Big-Data?!
form for index storage in Relational
Database Management Systems
(RDBMS)

M314 - Big Data / R. Abid @UM6P-CC 16


2. RDBMS & SQL
W H Y R E L A T I O N A L D B I S N O T A L W AY S
• Relational Model
T H E A N S W E R ?
• Transaction-oriented
• Reminder RDBMS is but a software running on top
• Query Language (SQL)
of an OS!
• ACID (Reminder)
• ☞ Semaphores & Message Passing
• Atomicity
• All changes must be “All or Nothing” to allow for transactions
• Consistency
• Now: What if our Data is not transactionally
• All data must be consistent in all “Replicates” and “Retrieve”
operations focused?
• Isolation • “… and therefore, sometimes, the Relational
• Any data being affected by one transaction must not be used by model is not the most appropriate for what
another transaction until the first is complete
we need to do with our data”
• Durability
• Once a transaction is committed, it should become permanent • E “The last statement may come as a shock to
and unalterable by machine failure of any type people for whom a database is a Relational
database!”

M314 - Big Data / R. Abid @UM6P-CC 17


3 . DATA I N T H E C L O U D
• Backup or Disaster Recovery?
• A fact: Hard Drives go wrong!
• The Cloud has enabled Database management • In database systems, it is normal for the DBA to be responsible for ensuring
no data is ever lost
systems to become pay-as-you-go
• Differences?
• Backup: No data is permanently lost
• Storing and tracking changes made over last periods,
• The Cloud provided organizations with the externally to the database
opportunity to store their data, and more • Recovery:

importantly store multiple copies of these data, • Replay the changes to the database since a point in time when
the entire DS was consistent
thus avoiding “single point of failure”, and • DR:
avoiding the need to have a “standby backup • When calamities happen, e.g., earthquakes, fires, floods,
server” accidental deletion of tables
• DR requires parallel systems (at different geographical sites)
• Amazon S3
• Very costly
• MTTR (Mean Time to Recovery): measures the speed by which
as system can becomes available again
• The Cloud is providing the Best Alternative

• ☞ How/Does this applies/relates to Big Data?

M314 - Big Data / R. Abid @UM6P-CC 18


4 . B I G DATA
DRIVERS FOR THE ADOPTION OF A
D I F F E R E N T DATA M O D E L

• RDBMS have been the center of most business • Google BigTable:


systems for decades • Not relational
A fact ☞“There is No real business without a • Relational model could not provide rapid text-based
Database” searches across the vast volumes of web data – Why?
• Large processing overhead for maintaining ACID
• However, as web-driven systems began to expand, properties over large data
it became clear that RDBMS are not good at • RDBMs potentially rely on the processor-hungry “joins”
everything! • Not the right tool for the task they had
• Quickly finding relevant data from TBs of unstructured
• Google eventually took the decision to write data (web content)
their own database for information searching • Although it is Google’s own product and not openly
• The story behind Google BigTable is well available, other such DBs exist: Hbase and Cassandra
claims similar data model to that of BigTable
documented by Chang et al. – 2008
.
• ☞ Paper (to discuss) available in Course Portal

M314 - Big Data / R. Abid @UM6P-CC 19


MASTER USE CASE:
T H E G O O G L E B I G TA B L E DATA M O D E L

• History: • Keywords from the assigned paper:


• “Bigtable development began in 2004 and is • “Bigtable is a distributed storage sytem that is designed to
scale to a very large size”
now used by a number of Google applications,
• “The Bigtable clusters … ”
such as web indexing, MapReduce, which is often
• “Bigtable also treats data as uninterpreted strings, although
used for generating and modifying data stored clients often serialize various forms of structured and semi-
in Bigtable, Google Maps, Google Book Search, structured data into these strings”
"My Search History", Google • 2. Data Model: “A Bigtable is a sparse, distributed,
Earth, Blogger.com, Google persistent multi-dimensional sorted map. The map is indexed by
a row key, column key, and a timestamp; each value in the
Code hosting, YouTube, and Gmail. Google's
map is an uninterpreted array of bytes. “
reasons for developing its own database include
scalability and better control of performance • Hint ☞ about “what is a map”?
characteristics” • Associative Arrays (Google it?)
• Associate Mapping (TLB/Caching PTEs in OS)

M314 - Big Data / R. Abid @UM6P-CC 20


MASTER USE CASE:
T H E G O O G L E B I G TA B L E DATA M O D E L
CONT-
• “The row range for a table is dynamically partitioned. Each row range • “Internally, each SSTable contains a sequence of blocks (typically each
is called a tablet, which is the unit of distribution and load balancing.” block is 64KB in size, but this is configurable). A block index (stored at
the end of the SSTable) is used to locate blocks; the index is loaded into
• “Bigtable can be used with MapReduce, a framework for running large-
memory when the SSTable is opened.“
scale parallel computations developed at Google.”
• Hint ☞ iNode / File Systems / OS
☞ MapReduce is to cover during this class.
• “The Google SSTable file format is used internally to store Bigtable data.
• “Bigtable uses the distributed Google File System (GFS) to store log
An SSTable provides a persistent, ordered immutable map from keys to
and data files”.
values, …”
☞ GFS will be covered during this class.
• “Bigtable relies on a highly-available and persistent distributed lock
• “Operations are provided to look up the value associated with a service called Chubby “
specified key, and to iterate over all key/value pairs in a specified key
• Hint ☞ Synchronization/Mutual Exclusion/Message Passing (OS)
range”

M314 - Big Data / R. Abid @UM6P-CC 21


4 . B I G DATA
DRIVERS FOR THE ADOPTION OF A
D I F F E R E N T DATA M O D E L – C O N T.
• CAP Theorem (next slide) / Brewer (MIT)
• “Data Availability” as a driver
• Brewer (2012): 2 letters out of 3
• Every DBA wants their DB to be 99.99% available
☞ Which Letter to drop for Distributed RDBMS?
• But this sort of availability costs both time and money ☞ Which Letter to drop for NoSQL (Big Data)?
• Replications, fault tolerance The hint: Any Distributed System is prone to node-failure, thus “Partitioning”
needs to be tolerated! CAP concerns P.
• “ACID” as a driver
As a Consequence ☞ There are only 2 combinations!!! See next Question
• Most RDBMS guarantee that all values in all nodes are identical
before it allows a user to read the values. Still, this is at a
☞ THE QUESTION is:
significant cost.
• When a network Partition failure happens:
• Is this what we want with Web data, for instance? • Should you cancel the operation (send error msg!) and thus decrease the
• E NoSQL availability but ensure Consistency?
• OR
• Proceed with the operation and thus provide Availability but risk inconsistency?
• Answer in next slide -

M314 - Big Data / R. Abid @UM6P-CC 22


CAP
THEOREM

• Answer to previous Question


• for Distributed RDBMS(ACID):
☞ Usually drop A!

• for NoSQL (BASE):


☞ Ususally drop C!
M314 - Big Data / R. Abid @UM6P-CC 23
- BASE-
T H E N O S Q L / B I G DATA P RO P E R T Y
• The “Read-repair” approach
• The read operation will return the first value found
• E BASE is the NoSQL operating premise (like ACID • Any stale nodes discovered are marked for updating at a later
for RDBs) stage
• E Some “RDBMs People” will find it hard to handle the last
• BASE Basically Available Soft-State Eventually sentence
Consistent • In some applications, however, it is not critical for every user to
have identical data all the times
• “You will note that we move from a world of data • Examples
consistency to a world where all we are promised • A Discussion Board Application / e.g., Facebook
is that all copies of the data will, at some point, be • Web Search / Indexing

the same”! • Simple, low-level data access: lookup, filtering (selection)


• Access implemented as simple API provided by programming
language library or via communication protocol (HTTP/REST,
XML/RPC, SOAP, etc.)

M314 - Big Data / R. Abid @UM6P-CC 24


NOSQL DB
TYPES & TIMELINE
• NoSQL Timeline:
• 4 Types:
• 2006 → Google Bigtable (column store)
1. K-V store • Relevant paper available @course Portal
2. Document-based • 2007 → Amazon Dynamo (key-value)
3. Column-based • Relevant paper available @course Portal
• 2008 → Facebook Cassandra (column store)
4. Graph DBs
• 2009 → MongoDB (document store)

- Vector DBs? 5th? • 2010s → Graph DBs (Neo4j)


• ☞ “… are a new and emerging category, and their • 2020’s → Vector DBs
classification is still a topic of discussion, but they are • though, started as a technology decades ago (with
most often considered an extension of, or a new vector rep. of DNA and Geographical data)
distinct type within, the NoSQL family.” • FAISS (Facebook AI Similarity Seaerch) – 2017
• 2022+ (with ChatGPT /GenAI) – Chroma, Pinecone

M314 - Big Data / R. Abid @UM6P-CC 25


1 . K E Y- VA L U E S T O R E S

• Concept: • Use cases:


• Data stored as simple key–value • Session cache, real-time data, mostly
pairs (like a dictionary or hash map). unstructured data
• usually BLOBs • Typical companies / use cases:
• No Schema (unstructured) • Amazon (shopping carts, session
• Typical DB Systems: data), Netflix (caching), PayPal (user
• Redis, Amazon DynamoDB, Riak, sessions)
Aerospike

M314 - Big Data / R. Abid @UM6P-CC 26


2. DOCUMENT STORES

• Concept: • Use cases:


• Store semi-structured data • CMS (e.g., News) , catalogs, logs
(JSON/BSON/XML); • Typical companies / use cases:
• each record is a document • Meta (user profiles), eBay
• Typical DB Systems: (product catalogs), Forbes
• MongoDB, CouchDB, Firebase (content management)
Couchbase

M314 - Big Data / R. Abid @UM6P-CC 27


3 . C O L U M N - FA M I LY S T O R E S

• Concept: • Use cases:


• Store data by columns instead of • IoT, time-series, analytics
rows • Typical companies / use cases:
• optimized for wide-table • Instagram (user timelines), Uber
analytics. (geospatial data), Spotify
• Typical DB Systems: (metrics)
• Apache Cassandra, HBase,
Google Bigtable

M314 - Big Data / R. Abid @UM6P-CC 28


4 . G R A P H DATA B A S E D

• Concept: • Use cases:


• Represent entities as nodes & • Recommendations, fraud detection
relationships as edges • Typical companies / use cases:
• ideal for network/relationship • LinkedIn (connections), Facebook
data. (social graph), Airbus (supply-
• Typical DB Systems: chain paths).
• Neo4j, ArangoDB, Amazon
Neptune, JanusGraph

M314 - Big Data / R. Abid @UM6P-CC 29


• DynamoDB (Amazon)
• Redis (Twitter)
• Riak (AT&T, GitHub)
• BigTable (Google)
• MongoDB (CISCO, BOSCH, HSBC, ..)
• Cassandra (Facebook)
• Voldemort (LinkedIn)
• Hbase (Alibaba)
NOSQL • PNUTS (Yahoo)

EXAMPLES • BerkeleyDB (Oracle)


• Hbase (Apache)
• Open-source implementation of Google
BigTable

• Pinecone (OpenAI, Cisco)


• Chroma (Langchain int.)
• …. etc

M314 - Big Data / R. Abid @UM6P-CC


30
T
5 ?H VECTOR DB-
NEW TYPE OF NOSQL DB?
☞ The AI-Native Data Store • Why they Matter?
• “Vector Databases store embeddings - high-dimensional • Bridge Data Analytics and AI inference.
numeric vectors representing text, images, audio, or code • Core to Retrieval-Augmented Generation (RAG)
- to enable semantic search and similarity queries” systems.
• ”Vector databases represent the convergence of Big • Enable meaning-based search instead of keyword
Data and AI - turning knowledge into searchable matching.
numerical space • Power next-generation LLM and GenAI applications.
• Exception/footprint ☞ Unlike traditional DBs that
excel at exact match (SQL) or key-based lookups • Now, the qst is: Can you introduce them as the fifth major
(Key-Value NoSQL), a Vector DB's main function is Type of NoSQL Database?.
similarity search (see last slide for Extra). • What do you think?
• a sub-set of K-V!?
• … now, some NoSQL DBs (e.g., MongoDB) added
extensions to support Vectors!

M314 - Big Data / R. Abid @UM6P-CC 31


VECTOR DB
S I M I LARI T Y S E ARCH

• In classical databases, we query for exact matches:


• SELECT * FROM users WHERE city = ‘Benguerir’;
• but in AI contexts — image, text, or audio embeddings — we want to find
semantically similar items, not identical ones.
• Similarity search ☞ “Find items that are most similar in meaning, not in literal value.”
• So instead of keyword matching, we use numerical similarity between vectors (lists
of numbers representing meanings).

M314 - Big Data / R. Abid @UM6P-CC 32


V E CTO R D B - S I M I LARI T Y S E ARCH
HOW IT WORKS?
• Step 1: Convert data into vectors • Step 3: Query with a vector
• Each item (text, image, document) is converted into • When you ask: “green energy projects in North
a vector embedding using an AI model: Africa”,
• Sentence: “Renewable energy in Morocco” → • the query is also converted into a vector, and the
[0.12, -0.33, 0.85, …] database searches for vectors close to it (i.e.,
• Another: “Solar power near Ouarzazate” → [0.11, similar) — not identical strings.
-0.31, 0.87, …] • Vector DBs use mathematical distance metrics to quantify
closeness, e.g.,
• Step 2: Store embeddings in a Vector Database
• Cosine similarity
• Each vector is stored with its ID and metadata: • Angle between vectors (ignores magnitude)
• ID: doc_001 • Text embeddings
• Vector: [0.12, -0.33, 0.85, ...] • Euclidean distance
• Straight-line distance in space
• Metadata: {title: "Morocco Solar Energy"}
• Images, 3D data
• Dot product
• magnitude + direction

33
M314 - Big Data / R. Abid @UM6P-CC

V E CTO R D B - S I M I LARI T Y S E ARCH


MAIN APPLICATIONS

Domain Example
Retrieve semantically relevant context chunks for
LLMs / RAG
GPT/LLM responses.
Replace keyword search with meaning-based
Search Engines
document retrieval.
Suggest similar products, songs, or movies based on
Recommendation Systems
vector similarity.
Computer Vision Find visually similar images or objects.
Voice / Audio Match similar sounds or speakers.

34
T H E S E D AY S ’ T R E N D ?
M314 - Big Data / R. Abid @UM6P-CC

M O D E R N B I G DATA
ARCH I T E CT U RE S
• The integration of Vector Databases
(Vector DBs) with Big Data platforms is
driven by the need to combine large-
scale data storage with AI-powered
semantic search
• The most common and robust solutions
today involve integrating the Vector DB
as a specialized component of a Data
Lakehouse architecture (e.g., Databricks,
Snowflake-Cortex)
☞ to be covered in a later chapter

35

You might also like