0% found this document useful (0 votes)
16 views29 pages

Lecture - Week04

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views29 pages

Lecture - Week04

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Maestría en Ciencia de Datos

Big Data

Week 4
Working with Data Models

Dr Sandra Ortega-Martorell

In this session…
• Streaming data: what it is and why is different

• Data Lakes

• Exploring streaming data

• DBMS-based and non-DBMS-based approaches to Big data

• Big Data Management Systems

• Retrieving Big Data: querying relational data with PostgreSQL

2
S Ortega-Martorell

1
Streaming data: what it is and why is different

3
S Ortega-Martorell

Streaming data
• We mentioned previously that one of the Big Data challenges
was the velocity of data, coming in varying rates.

• For some applications this presents the need to process data


as it is generated, or in other words, as it streams.

• We call these types of applications:

Streaming Data Processing Applications

• This terminology refers to a constant stream of data flowing from a source.


– E.g. data from a sensory machine or data from social media.

4
S Ortega-Martorell

2
Streaming data – example application
• FlightStats, [Link]
– It processes ~60 million weekly flight events that come into their data acquisition system, and turns it into real-time
intelligence for airlines and millions of travellers, daily.

5
S Ortega-Martorell

Data stream – what is it?


• A possibly unbounded sequence of data records
that may or may not be related to, or correlated
with each other.

• Each data is generally timestamped and in some


cases geo-tagged.

• Data can stream from many sources.


– E.g. instruments, many internet of things
application areas, computer programs, websites,
or social media posts

• Streaming data sometimes gets referred to as


event data as each data item is treated as an
individual event in a synchronised sequence.

6
S Ortega-Martorell

3
Data stream – challenges
• Conventional data management architectures are built primarily on the concept of persistent,
static data collections.

• Streams pose very difficult challenges for these conventional data management architectures
– as most often we have only one chance to look at, and process, streaming data before receiving
more.

• Streaming data management systems cannot be separated from real-time processing of data.
– Managing and processing data in motion is a typical capability of streaming data systems.

7
S Ortega-Martorell

Streaming Data Systems – characteristics

Streaming Data Systems


Designed to manage relatively simple computations
Data Stream – one record at a time or small time window of data
The sheer size, variety and velocity
of big data adds further challenges
Computations are done in near-real-time
– sometimes in memory

Computations are independent

The processing components often subscribe to a


stream source non-interactively
– this means they send nothing back to the source, nor
establish interaction with the source

8
S Ortega-Martorell

4
Data stream – dynamic steering
• The concept of dynamic steering involves dynamically changing the next steps or direction
of an application through a continuous computational process using streaming.
– Dynamic steering is often a part of streaming data management and processing.

Example of dynamic steering application: Self-driving car

9
S Ortega-Martorell

Streaming Data Systems


• Examples of big data streaming systems:

10
S Ortega-Martorell

10

5
Why is Streaming Data different?

Data-at-rest Data-in-motion
• Mostly static data from one or • Analysed as it is generated
more sources – E.g. sensor data processing in a plane
or a self-driving car
• Collected prior to analysis

Analysis of data-at-rest is called Analysis of data-in-motion is called


batch or static processing stream processing
11
S Ortega-Martorell

11

Data processing algorithms

Size determines time and space:


Static / Batch Processing – The run time and memory usage of most algorithms is usually
dependent on the data size, which can easily be calculated from
files or databases.

Unbounded size, but finite time and space:


– The size of the data is unbounded and this changes the types of
Streaming Processing algorithms that can be used.
– Algorithms that require iterating or looping over the whole data set
are not possible since with stream data, you never get to the end.

12
S Ortega-Martorell

12

6
Streaming Data Management and Processing
• Streaming Data Management and Processing should enable:

1. Computations on one data element or a small window


of data elements at a time.
– These computations can update metrics, monitor and
plot statistics on the streaming data.

2. Relatively fast and simple computations


– Since computations need to be completed in real time, the tasks processing streaming data should
be quicker (or not much longer) than the streaming rate of the data (data velocity).

3. No interactions with the data source


– In most streaming systems, the management and processing system subscribe to the data source,
but does not send anything back to the stream source in terms of feedback or interactions.
13
S Ortega-Martorell

13

Streaming Data
• These requirements for streaming data processing are quite different than batch processing.

• In batch processing, the analytical steps have access to (often) all data and can take more
time to complete a complex analytical task with less pressure on the completion time of
individual data management and processing tasks.

• Most organisations today use a hybrid architecture for processing streaming and batch jobs
at the same time, which sometimes get referred to as the lambda architecture.

14
S Ortega-Martorell

14

7
Streaming Data – Lambda architecture
• Lambda architecture is a data-processing architecture designed to handle massive
quantities of data by taking advantage of both batch- and stream-processing methods.

Now

Batch Real-time
Batch Real-time
Batch Real-time


Time

15
S Ortega-Martorell

15

Streaming data – challenges


• Scalability
– To accommodate rapid growth in traffic and data volume (scaling up)
– To adapt to decreases in demand (scaling down)

• Data replication and durability

Data Availability
– Refers to system uptime, i.e. the storage system is operational and
can deliver data upon request.
Data Availability vs. Durability
They are not the same thing Data Durability
– Refers to long-term data protection
• i.e. the stored data does not suffer from bit rot, degradation or other
corruption.
– It is concerned with data redundancy rather than hardware
redundancy, so that data is never lost or compromised.
16
S Ortega-Martorell

16

8
Streaming data – challenges
• These two main challenges mentioned before will need to be overcome to avoid data loss,
and enable real time analytical tasks.

The size and frequency of the stream data can significantly change over time.
Data changes may be periodic and sporadic.

17
S Ortega-Martorell

17

Streaming data – changes in size and frequency


The size and frequency of the stream data can significantly change over time

Example:
Size Streaming data found on social networks can increase in
volume during holidays, sports matches, or major news
events.

Periodic
Frequency Sporadic
18
S Ortega-Martorell

18

9
Streaming data – periodic changes
Data changes may be periodic

Periodic: evenings,
weekends, etc.

Example:
People may post messages on social
media more in the evenings.

19
S Ortega-Martorell

19

Streaming data – sporadic changes


Data changes may be sporadic

Sporadic: major
events.

Examples:

There can be an increase in data size


and frequency during major events,
sport matches, etc.

Dropping or missing data when


there are network problems.
20
S Ortega-Martorell

20

10
Streaming data – extreme changes example
• Example of extreme data fluctuation:
Average Tweets / Second: 6,000

During the first 10 years of Twitter, the record for Most tweets per minute was set during
Germany's victory over Argentina during the 2014 World Cup.

There were 618,725 tweets in


the 60 seconds after the final
whistle was blown

21
S Ortega-Martorell

21

Data Lakes

22
S Ortega-Martorell

22

11
Data Lakes
• With big data streaming from different sources in
varying formats, models, and speeds, we need to be
able to ingest this data into a fast and scalable
storage system that is flexible enough to serve many
current and future analytical processes.

• This is when traditional data warehouses with strict


data models and data formats do not fit the big data
challenges for streaming and batch applications.

• The concept of a data lake was created in response


of these big data storage and processing challenges.

23
S Ortega-Martorell

23

What is a Data Lake?


• Simply speaking, a data lake is a part of a big data infrastructure that many streams can flow
into and get stored for processing in their original form.

• We can think of it as a massive storage depository with huge processing power and ability to
handle a very large number of concurrence, data management and analytical tasks.

24
S Ortega-Martorell

24

12
How do Data lakes work?
• The concept can be compared to a water body, a lake, where water flows in, filling up a reservoir and
flows out.

25
S Ortega-Martorell

25

How do Data lakes work? – in more detail


• In a Data Lake:
– The data gets loaded from its source, stored in its native format until it is needed, at which time,
the applications can freely read the data and add structure to it.
– This is called ‘schema-on-read’.
– This approach ensures all data is stored for a potentially unknown use at a later time.

• In contrast, in a Data Warehouse:


– The data is loaded into the warehouse after transforming it into a well-defined and structured
format.
– This is called ‘schema-on-write’.
– Any application using the data needs to know this format in order to retrieve and use the data.
– In this approach, data is not loaded into the warehouse unless there is a use for it.

26
S Ortega-Martorell

26

13
Data warehouse vs. Data lake

Data warehouse Data lake


• Stores data in a hierarchical file system • Stores data as flat files with a unique identifier.
with a well-defined structure • This often gets referred to as object storage in
big data systems.
Hierarchical file system

Object storage
27
S Ortega-Martorell

27

Data lake object storage

Data lake
• Each data is stored as a Binary Large
Object (BLOB) and is assigned a unique
identifier.
• Each data object is tagged with a
number of metadata tags.
• The data can be searched using these
metadata tags to retrieve it.

• In Hadoop data architectures, data is loaded into HDFS and processed using the appropriate data management and
analytical systems on commodity clusters.
• The selection of the tools is based on the nature of the problem being solved, and the data format being accessed.
28
S Ortega-Martorell

28

14
Data lakes – summary
• A Big Data storage architecture

• Collects all data for current and future use/analysis

• Transforms data format only when needed (schema-on-read)

• Supports all types of Big data users

• Data Lakes Adapt Easily to Changes


– Infrastructure component that evolve over time, based on application-specific needs

29
S Ortega-Martorell

29

DBMS-based and non-DBMS-based


approaches to Big data

30
S Ortega-Martorell

30

15
Storing data – Files vs. DBMS
• In the past, database operations were applications in file systems.

• Problems with this approach:


– Data redundancy, inconsistency, and isolation
Multiple file formats, often with duplication of the information, large and complex files.
– Each task a problem
No uniform way to access data.
E.g. finding employees in a department sorted by their salary vs. finding employees in all departments sorted by
their start date -> two separate programs.
– Data integrity
Enforcement of constraints (also called integrity constraints).
E.g. every employee has exactly one job title. Changing this rule implies finding where it was coded.
– Atomicity of updates (all of the changes must happen altogether, as a single unit)
This has to do with system failures.
E.g. if there is a failure when updating a series of events, it is not easy to fix or even start all over again.

Hence the need for the transition to a DBMS


31
S Ortega-Martorell

31

Advantages of a DBMS
Current DBMSs, especially relational DBMSs, have a number
of advantages:

1. Declarative query languages


– Declarative means that we state what we want to retrieve without saying how to retrieve it.
– E.g. We can say ‘find the average salary of employees in the IT division for every job title and sort
from high to low’. We do not need to say ‘how’ to extract or group these records.
– Hence, no more task-based programs.

2. Data independence
– Applications do not worry about data storage formats and locations.
– The goal of data independence is to isolate the users from the record layout so long as the logical
definition of the data (tables and their attributes) are clearly specified.

32
S Ortega-Martorell

32

16
Advantages of a DBMS
3. Effective access through optimisation
– The system automatically finds an efficient way to access data
• Even when there are a large number of tables and hundreds of millions of records.

4. Data integrity and security A C I D


– Methods to keep the accuracy and
consistency of data despite failure Atomicity: Consistency: Isolation: Durability:
Transactions Only valid Transactions When written
• Transaction safety: ACID properties are all or data is saved do not affect data will not
of the transactions. nothing each other be lost
• Failure recovery

5. Concurrent access
– Many users can simultaneously access data without conflict
• E.g. An airline reservation system with lots of people buying tickets at the same time, the DBMS must
ensure that a ticket is not sold twice. Or if someone is in the middle of buying the last ticket, another
person does not see that ticket as available.
33
S Ortega-Martorell

33

How traditional databases handle large data volumes?


• Parallel Database System
– Improves performance through parallel implementation
• Operations like selection and join use parallel algorithms
– Often allows data replication
• Data redundancy against table corruption
• More concurrent queries

• Distributed Database System


– Data is stored across several sites, each site managed by a DBMS capable of running
independently.
• In this case, one component knows some part of the schema of it is neighbouring DBMS and can
pass a query, or part of a query, to the neighbour when needed.

34
S Ortega-Martorell

34

17
DBMS-based approaches
• Typical question:

If DBMSs are so powerful, why do we see the


rise of MapReduce-style Systems?

Unfortunately, the answer is not straightforward.

35
S Ortega-Martorell

35

DBMS and MapReduce-style Systems


• DBMSs: efficient storage, transactions and retrieval
– Partitioned data parallelism
• Different parts of a logical table can physically reside on different machines, then different parts of a
query can access the partitions in parallel and speed up query performance.
– Optimised computation and communication cost (i.e. time to exchange data between machines)
– However, they do not take into account machine failure

MapReduce (MR) was developed not for storage and retrieval,


but for distributive processing of large amounts of data.

• MR-style systems: their goal was to support complex data processing over a cluster of machines
– Since MR implementations are over HDFS, issues like node failure are automatically accounted for.
– They can be effectively used for data analytics: data mining, clustering, machine learning.
– Multi-stage, problem-specific algorithms are handled naturally (as opposed to in a RDBMS).
– They operate on wider variety of data, including unstructured data like text.
36
S Ortega-Martorell

36

18
Tension points in the data management world
The mixture of Data Management requirements and Data Processing Analysis
requirements have created an interesting tension in the data management world.

A few of these tension points are:

1. Data loading – a new bottleneck


– DBMSs perform storage and retrieval operations very efficiently. But first, the data must be loaded
into the DBMS. So, how long does loading take?
– Does the application need data sooner than the loading time?

2. Too much functionality


– For very simple operations (reading records) involving a huge table, do we need all the guarantees
offered by DBMSs?
– Does the application use only a few data management features?

37
S Ortega-Martorell

37

Tension points in the data management world

3. Combination of Transactional and Analytical capabilities


– There is an emerging class of optimisation that meets all the nice transactional guarantees that a
DBMS provides.
– At the same time, meets the support for efficient analytical operations.
– These are often required for systems like Real-Time Decision Support.

38
S Ortega-Martorell

38

19
Mixed solutions
The combination of traditional requirements with new ones is leading to new capabilities and
products.
• DBMS-Hadoop interoperation
– DBMS technologies are creating new techniques that make use of MapReduce-style data processing. Many of
them run on HDFS.
– DBMSs are providing ways to perform a MapReduce-style operation on HDFS files and exchange data between
the Hadoop subsystem and the DBMS.
• Relational operations in MapReduce systems like Spark
– Simple map and reduce operations are not sufficient for many data operations.
– Spark has several kinds of join and data grouping operations in addition to map and reduce.
• Streaming input to DBMS
– Some DBMSs are making use of large distributed memory management operations to accept streaming data.
• New parallel programming models for analytical computation within DBMS
– These algorithms use a MR-style computing and are becoming a part of a new generation of DBMS products
that invoke these algorithms from inside the database system. E.g. finding dense regions in a graph.
39
S Ortega-Martorell

39

Big Data Management Systems (BDMS)


[From DBMS to BDMS]

40
S Ortega-Martorell

40

20
Desired characteristics of BDMS
• A flexible, semi-structured, data model
– Support traditional application which requires the development of a schema, and also support applications
which require no schema, because the data can vary in terms of its attributes and relationships.

• Support for today’s common Big Data data types


– Textual, temporal, and spatial data values

• A full query language


– To effectively manage large volumes of data, it is often more convenient to use a query language and let the
query processor automatically determine optimal ways to receive data.
– This query language may or may not look like SQL, but it should at least be equally powerful.

• An efficient parallel query engine running on multiple machines


– The machines can be connected to a shared-nothing architecture, or shared-memory architecture, or a
shared cluster.
• The shared-nothing means two machines that do not share a risk on memory.

41
S Ortega-Martorell

41

Desired characteristics of BDMS


• Wide range of query sizes
– Some applications working on top of a BDMS will issue queries which will only have a few conditions and a
few small objects to return.
– But some applications, especially those generated by other software tools or machine learning algorithms,
can have many conditions and can return many large objects.

• Continuous data ingestion


– In many cases, a BDMS has both streaming and existing data, that needs to be combined to solve a problem.
– E.g. combination of streaming data from weather stations with historical data to make better forecasts.

• Scale gracefully to manage and query large volumes of data


– Designed to operate over a cluster, and know how to handle a failure.
– The system should be able to handle new machines joining or existing machines leaving the cluster.

• Full data management capability


– Easy to install, restart and configure, provide high availability and make operational management as simple
as possible.
42
S Ortega-Martorell

42

21
ACID and BASE
• ACID properties are hard to maintain in a BDMS:
A C I D
– There is too much data and too many updates from Atomicity: Consistency: Isolation: Durability:
Transactions Only valid Transactions When written
too many users.
are all or data is saved do not affect data will not
– The effort to maintain ACID properties may lead to a nothing each other be lost
significant slowdown of the system

• While ACID properties are still desirable, it


might be more practical to relax them.

• BASE relaxes ACID.

43
S Ortega-Martorell

43

CAP theorem
• A distributed computer system cannot simultaneously achieve:
– Consistency:
• Every read receives the most recent write or an error
– Availability:
• Every request receives a (non-error) response, without the guarantee that it contains the most recent
write
– Partition tolerance
• The system continues to operate despite an arbitrary number of messages being dropped (or delayed)
by the network between nodes

44
S Ortega-Martorell

44

22
CAP theorem Availability
Remains accessible and
operational at all times

Traditional relational databases:


A
PostgreSQL, MySQL, etc. Cassandra, CouchDB,
Dynamo-like systems.

CA AP
Pick two!

Consistency C CP P Partition Tolerance


Commits are atomic across the Only a total network failure can cause
entire distributed system the system to respond incorrectly
Hbase, MongoDB, Redis,
BigTable-like systems.
45
S Ortega-Martorell

45

Red Hot: The 2021 Machine Learning, AI and Data (MAD)


Landscape

[Link] 46
S Ortega-Martorell

46

23
Retrieving Big Data

47
S Ortega-Martorell

47

Data retrieval and Query language

Data retrieval is the way in which the desired data is


specified and retrieved from a data store

A query language is a language that allows us to


specify the data item we need
A query language is declarative:
• Specify what you need rather than how to obtain it
• SQL (Structured Query Language)
In contrast to a query language, a database programming language,
e.g. Oracle’s PL/SQL (Procedural Language/SQL):
• High-level procedural programming language
• Embeds query operations

48
S Ortega-Martorell

48

24
SQL
• Example
– Beer Drinkers Club that owns many bars, and each bar sells beer.
– Not every bar sells the same brands of beer, and even when they do, they may have different prices.
– It keeps information about the regular member customers.
– It also knows which member visits which bars, and which beer each member likes.

• Database schema

49
S Ortega-Martorell

49

Select-from-where
• Which beers are made by Heineken?
From Data Operations, this form of
Output attribute(s) query can also be represented as:
SELECT name
FROM Beers Table(s) to use
WHERE manf = ‘Heineken’

The condition(s) to satisfy


Strings like ‘Heineken’ are case-
sensitive and are put in quotes

50
S Ortega-Martorell

50

25
More example queries
• Find expensive beers

• Which businesses have a temporary license


(starts with 32) in San Diego?

51
S Ortega-Martorell

51

Select-Project queries in large tables


• Large tables can be partitioned

• There are many partitioned schemes


– One of them is a range partitioning on primary key

… …

Machine 1 Machine 2 Machine 5

52
S Ortega-Martorell

52

26
Select-Project queries in large tables

… …

Machine 1 Machine 2 Machine 5

• Two queries: SELECT *


– Find records for beers whose name FROM Beers
starts with ‘Am’ WHERE name like ‘Am%’ SELECT name
– Which beers are made by Heineken? FROM Beers
WHERE manf = ‘Heineken’
53
S Ortega-Martorell

53

Evaluating Select-Project queries for large data


• A query processing trick
– Use the partitioning information: just use partition 1.

SELECT *
Thus, so long as the system knows the partitioning
FROM Beers
strategy, it can make its job much more efficient
WHERE name like ‘Am%’

54
S Ortega-Martorell

54

27
Evaluating Select-Project queries for large data
• Let’s look at the second query in the same partition setting
– The query condition is on the second attribute, manf, so the same trick cannot be applied.
– As it stands, this time we need to look up in all partitions.
– It can be done in parallel.

SELECT name
FROM Beers
WHERE manf = ‘Heineken’

55
S Ortega-Martorell

55

Local and global indexing


• What if I had 100 machines, and the desired data is only in 20 of them?
– Should we needlessly go through all 100 machines?
– Can it not be avoided?

• Yes: to do this, it would need one more piece in the solution, called an index structure.
– Very simply, an index can be thought of as a reverse table,
where given the value in a column, you would get back
the records where the value appears.
– Using an index speeds up query processing significantly.
– With indexes, we can solve this problem in different ways:
• Use local index on each machine
• Use a machine index for each value
• Use a combined index in a global index server

This will use more space, but queries will be faster.


56
S Ortega-Martorell

56

28
Conclusions

57
S Ortega-Martorell

57

Summary
1. Summarised the key characteristics of a data stream and identified the requirements of
streaming data systems.
2. Described how Data Lakes enable batch processing of streaming data, and explained the
difference between ‘schema-on-write’ and ‘schema-on-read’.
3. Explained how data streams, data lakes, and data warehouse are organised on a spectrum
of a big data management and storage.
4. Explained the advantages of using DBMS over a file system; specified the differences
between a parallel and a distributed file system; and described a MapReduce-style DBMS.
5. Explained the desirable characteristics of a Big Data Management System (BDMS), gave
examples of BDMSs, and described their similarities and differences.

58
S Ortega-Martorell

58

29

You might also like