0% found this document useful (0 votes)

5 views35 pages

Chap 2 Bigdata-Nosql Completed

The document discusses the concept of Big Data, defining it as datasets that exceed the capabilities of traditional database systems in terms of size, variety, and velocity. It introduces the '3Vs' of Big Data: Volume, Variety, and Velocity, along with a fourth V, Veracity, which pertains to data quality. Additionally, it highlights the transition from relational database management systems (RDBMS) to NoSQL solutions, driven by the need for scalability and flexibility in handling large and diverse data sets.

Uploaded by

Mohi Gpt4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views35 pages

Chap 2 Bigdata-Nosql Completed

Uploaded by

Mohi Gpt4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

M314 - Big Data / R.

Abid @UM6P-CC

CHAP-2
B I G DATA / N O SQ L: WH AT ? / WH E R E ?

1
W H AT I S B I G DATA ?

• “In 2010 the term ‘Big Data’ was virtually

unknown, but by mid-2011 it was being
widely touted as the latest trend, with all
the usual hype. Like ‘cloud computing’
before it, the term has today been adopted
by everyone, from product vendors to
large-scale outsourcing and cloud service
providers keen to promote their offerings.
But what really is Big Data?”

• Source: Fujitsu White Book on Big Data (available @Course Portal)

M314 - Big Data / R. Abid @UM6P-CC 2

W H AT I S B I G DATA ?
• According to McKinsey:
• “Big Data refers to datasets whose size are beyond the ability of typical
database software tools to capture, store, manage and analyze”
• According to IDC:
• “Big Data is a new generation of technologies and architectures designed to extract value
economically from very large volumes of a wide variety of data by enabling high velocity
capture, discovery and analysis”
• According to O’Reilly:
• “Big data is data that exceeds the processing capacity of conventional database systems. The
data is too big, moves too fast, or does not fit the structures of existing database architectures.
To gain value from these data, there must be an alternative way to process it”
☞ What are the Typical/Conventioanl DB software we are talking about here?

M314 - Big Data / R. Abid @UM6P-CC 3

THE ‘3VS’
OF BIG
DATA

4
M314 - Big Data / R. Abid @UM6P-CC
I. VOLUME

Volume = Size

5
M314 - Big Data / R. Abid @UM6P-CC
I. VOLUME
CHALLENGES?
• Exponential Data Growth
• Scaling up/down

• Capturing
• Means?

• Storage
• Distributed Storage
• ☞ often with Replicas

• Retrieval / Access
• Time?

• Processing
• Means?
• Real-time or Batch Processing?
• Cost?

M314 - Big Data / R. Abid @UM6P-CC 6

I I . VA R I E T Y

• Variety = Complexity

• Data were confined only to Tables

• Today, Data are more heterogeneous

• Different sources, formats, etc.

M314 - Big Data / R. Abid @UM6P-CC 7

II. AXES OF
DATA VA R I E T Y

• Variety with one type – Think of an email Collection:

• Sender, receiver, date... - Well-structured
• Body of the email - Text
• Attachments - Multi-Media
• Who-sends-to-whom - Network
• A current email cannot reference
a past email - Semantics
• Real-time - Availability

M314 - Big Data / R. Abid @UM6P-CC 8

I I . I M PAC T S O F
DATA VA R I E T Y
• Harder to ingest

• Difficult to create common storage

• Difficult to compare and match data across

variety

• Difficult to integrate

M314 - Big Data / R. Abid @UM6P-CC 9

III. VELOCITY
• Velocity = Speed

• Speed in creating Data

• Speed in storing Data
• Speed in processing Data

• Speed ☞ Real-time Action

• SCADA, Health, Stock Exchange, etc.

• Late Action ☞ Missing Opportunity

10
M314 - Big Data / R. Abid @UM6P-CC
III. VELOCITY
REAL-TIME PROCESSING
VS. BATCH PROCESSING

• Streaming Data = ‘What’s going on right now?”

• Streaming Data + Real-Time Processing

• = ☞ Agile & Adaptable (Business) Decision-Making

M314 - Big Data / R. Abid @UM6P-CC 11

III. VELOCITY
SPEED OF DATA GENERATION AND/OR DATA
PROCESSING?
WHICH PATH TO CHOOSE?

12
M314 - Big Data / R. Abid @UM6P-CC
THE 4TH V -
V E R AC I T Y

• Veracity = Quality (Validity, Volatility)

• Accuracy of data

• Reliability of the data source

• Context within Analysis

13
M314 - Big Data / R. Abid @UM6P-CC
I V. V E R A C I T Y
( VA L I D I T Y, V O L A T I L T I Y )
• Data veracity has given rise to two other big V’s of Big Data:

• Validity:
• means that the data is correct and accurate for the
intended use

• Volatility:
• refers to the rate of change and lifetime of the data

14
M314 - Big Data / R. Abid @UM6P-CC
V. T H E VA L U E
THE MOST
IMPORTANT V OF
ALL
• Value = (Actionable) Insight

• Just having Big Data is of no use unless we can turn

it into value
• Extract useful Information/Knowledge ☞
Mostly of ‘Economic/Political Value’
• Companies, organizations, and states are starting
to generate amazing value from their Big Data!

• How to get Value out of Big Data?

• ☞ DATA SCIENCE

15
M314 - Big Data / R. Abid @UM6P-CC
TOWARDS NOSQL-
H I S T O R I C A L R E V I E W O F DATA B A S E
STORAGE AND MGT MODELS
1. The first standard: Hierarchical 2. RDBMS and SQL
Databases • ACID Characteristics
• Tree-like hierarchy
3. Data in the Cloud
• When querying, you navigate
• DBaaS
down the tree
• Storage as a Service (Amazon S3)
• Quite similar to current B-Tree
• B-Tree index is the current default 4. Big-Data?!
form for index storage in Relational
Database Management Systems
(RDBMS)

M314 - Big Data / R. Abid @UM6P-CC 16

2. RDBMS & SQL
W H Y R E L A T I O N A L D B I S N O T A L W AY S
• Relational Model
T H E A N S W E R ?
• Transaction-oriented
• Reminder RDBMS is but a software running on top
• Query Language (SQL)
of an OS!
• ACID (Reminder)
• ☞ Semaphores & Message Passing
• Atomicity
• All changes must be “All or Nothing” to allow for transactions
• Consistency
• Now: What if our Data is not transactionally
• All data must be consistent in all “Replicates” and “Retrieve”
operations focused?
• Isolation • “… and therefore, sometimes, the Relational
• Any data being affected by one transaction must not be used by model is not the most appropriate for what
another transaction until the first is complete
we need to do with our data”
• Durability
• Once a transaction is committed, it should become permanent • E “The last statement may come as a shock to
and unalterable by machine failure of any type people for whom a database is a Relational
database!”

M314 - Big Data / R. Abid @UM6P-CC 17

3 . DATA I N T H E C L O U D
• Backup or Disaster Recovery?
• A fact: Hard Drives go wrong!
• The Cloud has enabled Database management • In database systems, it is normal for the DBA to be responsible for ensuring
no data is ever lost
systems to become pay-as-you-go
• Differences?
• Backup: No data is permanently lost
• Storing and tracking changes made over last periods,
• The Cloud provided organizations with the externally to the database
opportunity to store their data, and more • Recovery:

importantly store multiple copies of these data, • Replay the changes to the database since a point in time when
the entire DS was consistent
thus avoiding “single point of failure”, and • DR:
avoiding the need to have a “standby backup • When calamities happen, e.g., earthquakes, fires, floods,
server” accidental deletion of tables
• DR requires parallel systems (at different geographical sites)
• Amazon S3
• Very costly
• MTTR (Mean Time to Recovery): measures the speed by which
as system can becomes available again
• The Cloud is providing the Best Alternative

• ☞ How/Does this applies/relates to Big Data?

M314 - Big Data / R. Abid @UM6P-CC 18

4 . B I G DATA
DRIVERS FOR THE ADOPTION OF A
D I F F E R E N T DATA M O D E L

• RDBMS have been the center of most business • Google BigTable:

systems for decades • Not relational
A fact ☞“There is No real business without a • Relational model could not provide rapid text-based
Database” searches across the vast volumes of web data – Why?
• Large processing overhead for maintaining ACID
• However, as web-driven systems began to expand, properties over large data
it became clear that RDBMS are not good at • RDBMs potentially rely on the processor-hungry “joins”
everything! • Not the right tool for the task they had
• Quickly finding relevant data from TBs of unstructured
• Google eventually took the decision to write data (web content)
their own database for information searching • Although it is Google’s own product and not openly
• The story behind Google BigTable is well available, other such DBs exist: Hbase and Cassandra
claims similar data model to that of BigTable
documented by Chang et al. – 2008
.
• ☞ Paper (to discuss) available in Course Portal

M314 - Big Data / R. Abid @UM6P-CC 19

MASTER USE CASE:
T H E G O O G L E B I G TA B L E DATA M O D E L

• History: • Keywords from the assigned paper:

• “Bigtable development began in 2004 and is • “Bigtable is a distributed storage sytem that is designed to
scale to a very large size”
now used by a number of Google applications,
• “The Bigtable clusters … ”
such as web indexing, MapReduce, which is often
• “Bigtable also treats data as uninterpreted strings, although
used for generating and modifying data stored clients often serialize various forms of structured and semi-
in Bigtable, Google Maps, Google Book Search, structured data into these strings”
"My Search History", Google • 2. Data Model: “A Bigtable is a sparse, distributed,
Earth, Blogger.com, Google persistent multi-dimensional sorted map. The map is indexed by
a row key, column key, and a timestamp; each value in the
Code hosting, YouTube, and Gmail. Google's
map is an uninterpreted array of bytes. “
reasons for developing its own database include
scalability and better control of performance • Hint ☞ about “what is a map”?
characteristics” • Associative Arrays (Google it?)
• Associate Mapping (TLB/Caching PTEs in OS)

M314 - Big Data / R. Abid @UM6P-CC 20

MASTER USE CASE:
T H E G O O G L E B I G TA B L E DATA M O D E L
CONT-
• “The row range for a table is dynamically partitioned. Each row range • “Internally, each SSTable contains a sequence of blocks (typically each
is called a tablet, which is the unit of distribution and load balancing.” block is 64KB in size, but this is configurable). A block index (stored at
the end of the SSTable) is used to locate blocks; the index is loaded into
• “Bigtable can be used with MapReduce, a framework for running large-
memory when the SSTable is opened.“
scale parallel computations developed at Google.”
• Hint ☞ iNode / File Systems / OS
☞ MapReduce is to cover during this class.
• “The Google SSTable file format is used internally to store Bigtable data.
• “Bigtable uses the distributed Google File System (GFS) to store log
An SSTable provides a persistent, ordered immutable map from keys to
and data files”.
values, …”
☞ GFS will be covered during this class.
• “Bigtable relies on a highly-available and persistent distributed lock
• “Operations are provided to look up the value associated with a service called Chubby “
specified key, and to iterate over all key/value pairs in a specified key
• Hint ☞ Synchronization/Mutual Exclusion/Message Passing (OS)
range”

M314 - Big Data / R. Abid @UM6P-CC 21

4 . B I G DATA
DRIVERS FOR THE ADOPTION OF A
D I F F E R E N T DATA M O D E L – C O N T.
• CAP Theorem (next slide) / Brewer (MIT)
• “Data Availability” as a driver
• Brewer (2012): 2 letters out of 3
• Every DBA wants their DB to be 99.99% available
☞ Which Letter to drop for Distributed RDBMS?
• But this sort of availability costs both time and money ☞ Which Letter to drop for NoSQL (Big Data)?
• Replications, fault tolerance The hint: Any Distributed System is prone to node-failure, thus “Partitioning”
needs to be tolerated! CAP concerns P.
• “ACID” as a driver
As a Consequence ☞ There are only 2 combinations!!! See next Question
• Most RDBMS guarantee that all values in all nodes are identical
before it allows a user to read the values. Still, this is at a
☞ THE QUESTION is:
significant cost.
• When a network Partition failure happens:
• Is this what we want with Web data, for instance? • Should you cancel the operation (send error msg!) and thus decrease the
• E NoSQL availability but ensure Consistency?
• OR
• Proceed with the operation and thus provide Availability but risk inconsistency?
• Answer in next slide -

M314 - Big Data / R. Abid @UM6P-CC 22

CAP
THEOREM

• Answer to previous Question

• for Distributed RDBMS(ACID):
☞ Usually drop A!

• for NoSQL (BASE):

☞ Ususally drop C!
M314 - Big Data / R. Abid @UM6P-CC 23
- BASE-
T H E N O S Q L / B I G DATA P RO P E R T Y
• The “Read-repair” approach
• The read operation will return the first value found
• E BASE is the NoSQL operating premise (like ACID • Any stale nodes discovered are marked for updating at a later
for RDBs) stage
• E Some “RDBMs People” will find it hard to handle the last
• BASE Basically Available Soft-State Eventually sentence
Consistent • In some applications, however, it is not critical for every user to
have identical data all the times
• “You will note that we move from a world of data • Examples
consistency to a world where all we are promised • A Discussion Board Application / e.g., Facebook
is that all copies of the data will, at some point, be • Web Search / Indexing

the same”! • Simple, low-level data access: lookup, filtering (selection)

• Access implemented as simple API provided by programming
language library or via communication protocol (HTTP/REST,
XML/RPC, SOAP, etc.)

M314 - Big Data / R. Abid @UM6P-CC 24

NOSQL DB
TYPES & TIMELINE
• NoSQL Timeline:
• 4 Types:
• 2006 → Google Bigtable (column store)
1. K-V store • Relevant paper available @course Portal
2. Document-based • 2007 → Amazon Dynamo (key-value)
3. Column-based • Relevant paper available @course Portal
• 2008 → Facebook Cassandra (column store)
4. Graph DBs
• 2009 → MongoDB (document store)

- Vector DBs? 5th? • 2010s → Graph DBs (Neo4j)

• ☞ “… are a new and emerging category, and their • 2020’s → Vector DBs
classification is still a topic of discussion, but they are • though, started as a technology decades ago (with
most often considered an extension of, or a new vector rep. of DNA and Geographical data)
distinct type within, the NoSQL family.” • FAISS (Facebook AI Similarity Seaerch) – 2017
• 2022+ (with ChatGPT /GenAI) – Chroma, Pinecone

M314 - Big Data / R. Abid @UM6P-CC 25

1 . K E Y- VA L U E S T O R E S

• Concept: • Use cases:

• Data stored as simple key–value • Session cache, real-time data, mostly
pairs (like a dictionary or hash map). unstructured data
• usually BLOBs • Typical companies / use cases:
• No Schema (unstructured) • Amazon (shopping carts, session
• Typical DB Systems: data), Netflix (caching), PayPal (user
• Redis, Amazon DynamoDB, Riak, sessions)
Aerospike

M314 - Big Data / R. Abid @UM6P-CC 26

2. DOCUMENT STORES

• Concept: • Use cases:

• Store semi-structured data • CMS (e.g., News) , catalogs, logs
(JSON/BSON/XML); • Typical companies / use cases:
• each record is a document • Meta (user profiles), eBay
• Typical DB Systems: (product catalogs), Forbes
• MongoDB, CouchDB, Firebase (content management)
Couchbase

M314 - Big Data / R. Abid @UM6P-CC 27

3 . C O L U M N - FA M I LY S T O R E S

• Concept: • Use cases:

• Store data by columns instead of • IoT, time-series, analytics
rows • Typical companies / use cases:
• optimized for wide-table • Instagram (user timelines), Uber
analytics. (geospatial data), Spotify
• Typical DB Systems: (metrics)
• Apache Cassandra, HBase,
Google Bigtable

M314 - Big Data / R. Abid @UM6P-CC 28

4 . G R A P H DATA B A S E D

• Concept: • Use cases:

• Represent entities as nodes & • Recommendations, fraud detection
relationships as edges • Typical companies / use cases:
• ideal for network/relationship • LinkedIn (connections), Facebook
data. (social graph), Airbus (supply-
• Typical DB Systems: chain paths).
• Neo4j, ArangoDB, Amazon
Neptune, JanusGraph

M314 - Big Data / R. Abid @UM6P-CC 29

• DynamoDB (Amazon)
• Redis (Twitter)
• Riak (AT&T, GitHub)
• BigTable (Google)
• MongoDB (CISCO, BOSCH, HSBC, ..)
• Cassandra (Facebook)
• Voldemort (LinkedIn)
• Hbase (Alibaba)
NOSQL • PNUTS (Yahoo)

EXAMPLES • BerkeleyDB (Oracle)

• Hbase (Apache)
• Open-source implementation of Google
BigTable

• Pinecone (OpenAI, Cisco)

• Chroma (Langchain int.)
• …. etc

M314 - Big Data / R. Abid @UM6P-CC

30
T
5 ?H VECTOR DB-
NEW TYPE OF NOSQL DB?
☞ The AI-Native Data Store • Why they Matter?
• “Vector Databases store embeddings - high-dimensional • Bridge Data Analytics and AI inference.
numeric vectors representing text, images, audio, or code • Core to Retrieval-Augmented Generation (RAG)
- to enable semantic search and similarity queries” systems.
• ”Vector databases represent the convergence of Big • Enable meaning-based search instead of keyword
Data and AI - turning knowledge into searchable matching.
numerical space • Power next-generation LLM and GenAI applications.
• Exception/footprint ☞ Unlike traditional DBs that
excel at exact match (SQL) or key-based lookups • Now, the qst is: Can you introduce them as the fifth major
(Key-Value NoSQL), a Vector DB's main function is Type of NoSQL Database?.
similarity search (see last slide for Extra). • What do you think?
• a sub-set of K-V!?
• … now, some NoSQL DBs (e.g., MongoDB) added
extensions to support Vectors!

M314 - Big Data / R. Abid @UM6P-CC 31

VECTOR DB
S I M I LARI T Y S E ARCH

• In classical databases, we query for exact matches:

• SELECT * FROM users WHERE city = ‘Benguerir’;
• but in AI contexts — image, text, or audio embeddings — we want to find
semantically similar items, not identical ones.
• Similarity search ☞ “Find items that are most similar in meaning, not in literal value.”
• So instead of keyword matching, we use numerical similarity between vectors (lists
of numbers representing meanings).

M314 - Big Data / R. Abid @UM6P-CC 32

V E CTO R D B - S I M I LARI T Y S E ARCH
HOW IT WORKS?
• Step 1: Convert data into vectors • Step 3: Query with a vector
• Each item (text, image, document) is converted into • When you ask: “green energy projects in North
a vector embedding using an AI model: Africa”,
• Sentence: “Renewable energy in Morocco” → • the query is also converted into a vector, and the
[0.12, -0.33, 0.85, …] database searches for vectors close to it (i.e.,
• Another: “Solar power near Ouarzazate” → [0.11, similar) — not identical strings.
-0.31, 0.87, …] • Vector DBs use mathematical distance metrics to quantify
closeness, e.g.,
• Step 2: Store embeddings in a Vector Database
• Cosine similarity
• Each vector is stored with its ID and metadata: • Angle between vectors (ignores magnitude)
• ID: doc_001 • Text embeddings
• Vector: [0.12, -0.33, 0.85, ...] • Euclidean distance
• Straight-line distance in space
• Metadata: {title: "Morocco Solar Energy"}
• Images, 3D data
• Dot product
• magnitude + direction

33
M314 - Big Data / R. Abid @UM6P-CC

V E CTO R D B - S I M I LARI T Y S E ARCH

MAIN APPLICATIONS

Domain Example
Retrieve semantically relevant context chunks for
LLMs / RAG
GPT/LLM responses.
Replace keyword search with meaning-based
Search Engines
document retrieval.
Suggest similar products, songs, or movies based on
Recommendation Systems
vector similarity.
Computer Vision Find visually similar images or objects.
Voice / Audio Match similar sounds or speakers.

34
T H E S E D AY S ’ T R E N D ?
M314 - Big Data / R. Abid @UM6P-CC

M O D E R N B I G DATA
ARCH I T E CT U RE S
• The integration of Vector Databases
(Vector DBs) with Big Data platforms is
driven by the need to combine large-
scale data storage with AI-powered
semantic search
• The most common and robust solutions
today involve integrating the Vector DB
as a specialized component of a Data
Lakehouse architecture (e.g., Databricks,
Snowflake-Cortex)
☞ to be covered in a later chapter

Unit-1 Introduction To Big Data Analytics
No ratings yet
Unit-1 Introduction To Big Data Analytics
57 pages
Unit 1
No ratings yet
Unit 1
76 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
Future Revolution On Big Data
No ratings yet
Future Revolution On Big Data
24 pages
Introduction To Big Data Management
No ratings yet
Introduction To Big Data Management
53 pages
BDA - CHP 1
No ratings yet
BDA - CHP 1
141 pages
Lec 1 - Introduction To Big Data
No ratings yet
Lec 1 - Introduction To Big Data
37 pages
Unit 1 - BDS - DS307
No ratings yet
Unit 1 - BDS - DS307
47 pages
Big Data Management Essentials
No ratings yet
Big Data Management Essentials
32 pages
Lecture 3-Introduction To Big Data
No ratings yet
Lecture 3-Introduction To Big Data
25 pages
Module 01
No ratings yet
Module 01
50 pages
Unit I-Ch 01-Big Data Introduction
No ratings yet
Unit I-Ch 01-Big Data Introduction
40 pages
2020big Data
No ratings yet
2020big Data
60 pages
Big Data
No ratings yet
Big Data
23 pages
Big Data Analytics Overview
100% (6)
Big Data Analytics Overview
112 pages
Slide 1 Big Data Introduction
No ratings yet
Slide 1 Big Data Introduction
88 pages
Unit I
No ratings yet
Unit I
64 pages
Bigdatappt
No ratings yet
Bigdatappt
31 pages
Big Data: Characteristics and Impact
No ratings yet
Big Data: Characteristics and Impact
31 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Big Data Seminar Overview
No ratings yet
Big Data Seminar Overview
30 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
2
No ratings yet
2
37 pages
Predictive Analytics in Big Data
No ratings yet
Predictive Analytics in Big Data
33 pages
17 2017 Lecture1-2 INT312
0% (2)
17 2017 Lecture1-2 INT312
21 pages
Big Data Analytics IA
No ratings yet
Big Data Analytics IA
30 pages
Mod10-Wk10 CSG2132 Module 10 Big Data 2020
No ratings yet
Mod10-Wk10 CSG2132 Module 10 Big Data 2020
26 pages
Info System Big-Data-by-Dex
No ratings yet
Info System Big-Data-by-Dex
37 pages
Understanding Big Data: Key Concepts
No ratings yet
Understanding Big Data: Key Concepts
60 pages
BIG Data 1
No ratings yet
BIG Data 1
10 pages
Unit 1
No ratings yet
Unit 1
76 pages
Big Data CH 1
No ratings yet
Big Data CH 1
62 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
Dsc652 - Chapter 1 Introduction To Big Data Systems
No ratings yet
Dsc652 - Chapter 1 Introduction To Big Data Systems
27 pages
Database Trends & Innovations
No ratings yet
Database Trends & Innovations
5 pages
1 Introduction
No ratings yet
1 Introduction
68 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
83 pages
India’s Big Data Analytics Market Insights
No ratings yet
India’s Big Data Analytics Market Insights
30 pages
Lecture 1
No ratings yet
Lecture 1
22 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Unit I
No ratings yet
Unit I
25 pages
Lecture 6 BigData
No ratings yet
Lecture 6 BigData
61 pages
Qs Big Data and Data Warehousing
No ratings yet
Qs Big Data and Data Warehousing
45 pages
Lecture 3-Introduction To Big Data
No ratings yet
Lecture 3-Introduction To Big Data
25 pages
Unit 1.1 - Introduction To Big Data Analytics
No ratings yet
Unit 1.1 - Introduction To Big Data Analytics
19 pages
Big Data Analysis Fundamentals
No ratings yet
Big Data Analysis Fundamentals
43 pages
Big Data MINING AND TOOLS
No ratings yet
Big Data MINING AND TOOLS
44 pages
Big Data Analysis Introduction
No ratings yet
Big Data Analysis Introduction
42 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
31 pages
Big Data Analytics 1
No ratings yet
Big Data Analytics 1
21 pages
Bda Unit I LM
No ratings yet
Bda Unit I LM
14 pages
Understanding Big Data Computing
No ratings yet
Understanding Big Data Computing
25 pages
Big Data Seminar Overview
No ratings yet
Big Data Seminar Overview
31 pages
Lecture1 Introductiontobigdata 190301171350
No ratings yet
Lecture1 Introductiontobigdata 190301171350
63 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Big Data Presentation
No ratings yet
Big Data Presentation
24 pages
Big Data
No ratings yet
Big Data
43 pages
BART HanifMoumni Presentation
No ratings yet
BART HanifMoumni Presentation
37 pages
Part 7
No ratings yet
Part 7
41 pages
Part 2
No ratings yet
Part 2
33 pages
Part 1
No ratings yet
Part 1
48 pages
Prob pp2
No ratings yet
Prob pp2
2 pages
OS Concepts for Computer Science Students
No ratings yet
OS Concepts for Computer Science Students
10 pages
Motion Imagery Annotation Metadata Standard
No ratings yet
Motion Imagery Annotation Metadata Standard
7 pages
Configure WPA2 Enterprise WLAN on WLC
No ratings yet
Configure WPA2 Enterprise WLAN on WLC
18 pages
Main Parameters: Power Interface Working Current Matching Mode Baud Rate Character File Size
No ratings yet
Main Parameters: Power Interface Working Current Matching Mode Baud Rate Character File Size
19 pages
Concurrency Control in Databases and OS
No ratings yet
Concurrency Control in Databases and OS
8 pages
Understanding One-Dimensional Arrays
No ratings yet
Understanding One-Dimensional Arrays
6 pages
Frequently Used TRANSACTION CODE in SAP BI: S.No Tcode Description
No ratings yet
Frequently Used TRANSACTION CODE in SAP BI: S.No Tcode Description
3 pages
Extend Toad Code Xpert with SonarSource
100% (1)
Extend Toad Code Xpert with SonarSource
14 pages
Bda Unit-1
No ratings yet
Bda Unit-1
65 pages
Complex Integrity Constraints in SQL
No ratings yet
Complex Integrity Constraints in SQL
8 pages
20cs413-Database Management Systems
No ratings yet
20cs413-Database Management Systems
1 page
ER Modeling & Diagramming Guide
No ratings yet
ER Modeling & Diagramming Guide
5 pages
Motorola Falcon Dumpstate Report
No ratings yet
Motorola Falcon Dumpstate Report
4,142 pages
Binary File Notes
100% (1)
Binary File Notes
3 pages
DH Dvr0404hd S
No ratings yet
DH Dvr0404hd S
4 pages
CS 347 Database Quiz Questions
No ratings yet
CS 347 Database Quiz Questions
6 pages
Bascavr
No ratings yet
Bascavr
210 pages
Sap 2
88% (8)
Sap 2
35 pages
Set Up OpenLDAP Server On Centos
No ratings yet
Set Up OpenLDAP Server On Centos
15 pages
Intel 8051 Microcontroller Overview
No ratings yet
Intel 8051 Microcontroller Overview
10 pages
Overview of TCP/IP Reference Model
No ratings yet
Overview of TCP/IP Reference Model
3 pages
Interview
No ratings yet
Interview
222 pages
Application Update Using The USB Device Firmware Upgrade Class
No ratings yet
Application Update Using The USB Device Firmware Upgrade Class
33 pages
Back Up and Recovery Policy v1 2013
No ratings yet
Back Up and Recovery Policy v1 2013
7 pages
Printer Spool Scheduling System (Using Priority Queues and Leftist Heaps)
No ratings yet
Printer Spool Scheduling System (Using Priority Queues and Leftist Heaps)
18 pages
Industrial Communication Guide
No ratings yet
Industrial Communication Guide
65 pages
Mcqs of Computer 1
100% (1)
Mcqs of Computer 1
365 pages
2nd Assessment Exam in Specialization 2020 - 2021
No ratings yet
2nd Assessment Exam in Specialization 2020 - 2021
3 pages
How To Burn MP3+G (Karaoke) To Disc
100% (2)
How To Burn MP3+G (Karaoke) To Disc
11 pages
Safety First Using Clickhouse Backup For ClickHouse Backup and Restore 2023 10 25
No ratings yet
Safety First Using Clickhouse Backup For ClickHouse Backup and Restore 2023 10 25
38 pages
Hospital Management System Abstract
No ratings yet
Hospital Management System Abstract
4 pages

Chap 2 Bigdata-Nosql Completed

Uploaded by

Chap 2 Bigdata-Nosql Completed

Uploaded by

M314 - Big Data / R.

• “In 2010 the term ‘Big Data’ was virtually

• Source: Fujitsu White Book on Big Data (available @Course Portal)

M314 - Big Data / R. Abid @UM6P-CC 2

M314 - Big Data / R. Abid @UM6P-CC 3

M314 - Big Data / R. Abid @UM6P-CC 6

• Data were confined only to Tables

• Today, Data are more heterogeneous

M314 - Big Data / R. Abid @UM6P-CC 7

• Variety with one type – Think of an email Collection:

M314 - Big Data / R. Abid @UM6P-CC 8

• Difficult to create common storage

• Difficult to compare and match data across

M314 - Big Data / R. Abid @UM6P-CC 9

• Speed in creating Data

• Speed ☞ Real-time Action

• Late Action ☞ Missing Opportunity

• Streaming Data = ‘What’s going on right now?”

• Streaming Data + Real-Time Processing

• = ☞ Agile & Adaptable (Business) Decision-Making

M314 - Big Data / R. Abid @UM6P-CC 11

• Veracity = Quality (Validity, Volatility)

• Reliability of the data source

• Context within Analysis

• Just having Big Data is of no use unless we can turn

• How to get Value out of Big Data?

M314 - Big Data / R. Abid @UM6P-CC 16

M314 - Big Data / R. Abid @UM6P-CC 17

• ☞ How/Does this applies/relates to Big Data?

M314 - Big Data / R. Abid @UM6P-CC 18

• RDBMS have been the center of most business • Google BigTable:

M314 - Big Data / R. Abid @UM6P-CC 19

• History: • Keywords from the assigned paper:

M314 - Big Data / R. Abid @UM6P-CC 20

M314 - Big Data / R. Abid @UM6P-CC 21

M314 - Big Data / R. Abid @UM6P-CC 22

• Answer to previous Question

• for NoSQL (BASE):

the same”! • Simple, low-level data access: lookup, filtering (selection)

M314 - Big Data / R. Abid @UM6P-CC 24

- Vector DBs? 5th? • 2010s → Graph DBs (Neo4j)

M314 - Big Data / R. Abid @UM6P-CC 25

• Concept: • Use cases:

M314 - Big Data / R. Abid @UM6P-CC 26

• Concept: • Use cases:

M314 - Big Data / R. Abid @UM6P-CC 27

• Concept: • Use cases:

M314 - Big Data / R. Abid @UM6P-CC 28

• Concept: • Use cases:

M314 - Big Data / R. Abid @UM6P-CC 29

EXAMPLES • BerkeleyDB (Oracle)

• Pinecone (OpenAI, Cisco)

M314 - Big Data / R. Abid @UM6P-CC

M314 - Big Data / R. Abid @UM6P-CC 31

• In classical databases, we query for exact matches:

M314 - Big Data / R. Abid @UM6P-CC 32

V E CTO R D B - S I M I LARI T Y S E ARCH

You might also like