Lecture - Week04
Lecture - Week04
Big Data
Week 4
Working with Data Models
Dr Sandra Ortega-Martorell
In this session…
• Streaming data: what it is and why is different
• Data Lakes
2
S Ortega-Martorell
1
Streaming data: what it is and why is different
3
S Ortega-Martorell
Streaming data
• We mentioned previously that one of the Big Data challenges
was the velocity of data, coming in varying rates.
4
S Ortega-Martorell
2
Streaming data – example application
• FlightStats, [Link]
– It processes ~60 million weekly flight events that come into their data acquisition system, and turns it into real-time
intelligence for airlines and millions of travellers, daily.
5
S Ortega-Martorell
6
S Ortega-Martorell
3
Data stream – challenges
• Conventional data management architectures are built primarily on the concept of persistent,
static data collections.
• Streams pose very difficult challenges for these conventional data management architectures
– as most often we have only one chance to look at, and process, streaming data before receiving
more.
• Streaming data management systems cannot be separated from real-time processing of data.
– Managing and processing data in motion is a typical capability of streaming data systems.
7
S Ortega-Martorell
8
S Ortega-Martorell
4
Data stream – dynamic steering
• The concept of dynamic steering involves dynamically changing the next steps or direction
of an application through a continuous computational process using streaming.
– Dynamic steering is often a part of streaming data management and processing.
9
S Ortega-Martorell
10
S Ortega-Martorell
10
5
Why is Streaming Data different?
Data-at-rest Data-in-motion
• Mostly static data from one or • Analysed as it is generated
more sources – E.g. sensor data processing in a plane
or a self-driving car
• Collected prior to analysis
11
12
S Ortega-Martorell
12
6
Streaming Data Management and Processing
• Streaming Data Management and Processing should enable:
13
Streaming Data
• These requirements for streaming data processing are quite different than batch processing.
• In batch processing, the analytical steps have access to (often) all data and can take more
time to complete a complex analytical task with less pressure on the completion time of
individual data management and processing tasks.
• Most organisations today use a hybrid architecture for processing streaming and batch jobs
at the same time, which sometimes get referred to as the lambda architecture.
14
S Ortega-Martorell
14
7
Streaming Data – Lambda architecture
• Lambda architecture is a data-processing architecture designed to handle massive
quantities of data by taking advantage of both batch- and stream-processing methods.
Now
Batch Real-time
Batch Real-time
Batch Real-time
…
Time
15
S Ortega-Martorell
15
Data Availability
– Refers to system uptime, i.e. the storage system is operational and
can deliver data upon request.
Data Availability vs. Durability
They are not the same thing Data Durability
– Refers to long-term data protection
• i.e. the stored data does not suffer from bit rot, degradation or other
corruption.
– It is concerned with data redundancy rather than hardware
redundancy, so that data is never lost or compromised.
16
S Ortega-Martorell
16
8
Streaming data – challenges
• These two main challenges mentioned before will need to be overcome to avoid data loss,
and enable real time analytical tasks.
The size and frequency of the stream data can significantly change over time.
Data changes may be periodic and sporadic.
17
S Ortega-Martorell
17
Example:
Size Streaming data found on social networks can increase in
volume during holidays, sports matches, or major news
events.
Periodic
Frequency Sporadic
18
S Ortega-Martorell
18
9
Streaming data – periodic changes
Data changes may be periodic
Periodic: evenings,
weekends, etc.
Example:
People may post messages on social
media more in the evenings.
19
S Ortega-Martorell
19
Sporadic: major
events.
Examples:
20
10
Streaming data – extreme changes example
• Example of extreme data fluctuation:
Average Tweets / Second: 6,000
During the first 10 years of Twitter, the record for Most tweets per minute was set during
Germany's victory over Argentina during the 2014 World Cup.
21
S Ortega-Martorell
21
Data Lakes
22
S Ortega-Martorell
22
11
Data Lakes
• With big data streaming from different sources in
varying formats, models, and speeds, we need to be
able to ingest this data into a fast and scalable
storage system that is flexible enough to serve many
current and future analytical processes.
23
S Ortega-Martorell
23
• We can think of it as a massive storage depository with huge processing power and ability to
handle a very large number of concurrence, data management and analytical tasks.
24
S Ortega-Martorell
24
12
How do Data lakes work?
• The concept can be compared to a water body, a lake, where water flows in, filling up a reservoir and
flows out.
25
S Ortega-Martorell
25
26
S Ortega-Martorell
26
13
Data warehouse vs. Data lake
Object storage
27
S Ortega-Martorell
27
Data lake
• Each data is stored as a Binary Large
Object (BLOB) and is assigned a unique
identifier.
• Each data object is tagged with a
number of metadata tags.
• The data can be searched using these
metadata tags to retrieve it.
• In Hadoop data architectures, data is loaded into HDFS and processed using the appropriate data management and
analytical systems on commodity clusters.
• The selection of the tools is based on the nature of the problem being solved, and the data format being accessed.
28
S Ortega-Martorell
28
14
Data lakes – summary
• A Big Data storage architecture
29
S Ortega-Martorell
29
30
S Ortega-Martorell
30
15
Storing data – Files vs. DBMS
• In the past, database operations were applications in file systems.
31
Advantages of a DBMS
Current DBMSs, especially relational DBMSs, have a number
of advantages:
2. Data independence
– Applications do not worry about data storage formats and locations.
– The goal of data independence is to isolate the users from the record layout so long as the logical
definition of the data (tables and their attributes) are clearly specified.
32
S Ortega-Martorell
32
16
Advantages of a DBMS
3. Effective access through optimisation
– The system automatically finds an efficient way to access data
• Even when there are a large number of tables and hundreds of millions of records.
5. Concurrent access
– Many users can simultaneously access data without conflict
• E.g. An airline reservation system with lots of people buying tickets at the same time, the DBMS must
ensure that a ticket is not sold twice. Or if someone is in the middle of buying the last ticket, another
person does not see that ticket as available.
33
S Ortega-Martorell
33
34
S Ortega-Martorell
34
17
DBMS-based approaches
• Typical question:
35
S Ortega-Martorell
35
• MR-style systems: their goal was to support complex data processing over a cluster of machines
– Since MR implementations are over HDFS, issues like node failure are automatically accounted for.
– They can be effectively used for data analytics: data mining, clustering, machine learning.
– Multi-stage, problem-specific algorithms are handled naturally (as opposed to in a RDBMS).
– They operate on wider variety of data, including unstructured data like text.
36
S Ortega-Martorell
36
18
Tension points in the data management world
The mixture of Data Management requirements and Data Processing Analysis
requirements have created an interesting tension in the data management world.
37
S Ortega-Martorell
37
38
S Ortega-Martorell
38
19
Mixed solutions
The combination of traditional requirements with new ones is leading to new capabilities and
products.
• DBMS-Hadoop interoperation
– DBMS technologies are creating new techniques that make use of MapReduce-style data processing. Many of
them run on HDFS.
– DBMSs are providing ways to perform a MapReduce-style operation on HDFS files and exchange data between
the Hadoop subsystem and the DBMS.
• Relational operations in MapReduce systems like Spark
– Simple map and reduce operations are not sufficient for many data operations.
– Spark has several kinds of join and data grouping operations in addition to map and reduce.
• Streaming input to DBMS
– Some DBMSs are making use of large distributed memory management operations to accept streaming data.
• New parallel programming models for analytical computation within DBMS
– These algorithms use a MR-style computing and are becoming a part of a new generation of DBMS products
that invoke these algorithms from inside the database system. E.g. finding dense regions in a graph.
39
S Ortega-Martorell
39
40
S Ortega-Martorell
40
20
Desired characteristics of BDMS
• A flexible, semi-structured, data model
– Support traditional application which requires the development of a schema, and also support applications
which require no schema, because the data can vary in terms of its attributes and relationships.
41
S Ortega-Martorell
41
42
21
ACID and BASE
• ACID properties are hard to maintain in a BDMS:
A C I D
– There is too much data and too many updates from Atomicity: Consistency: Isolation: Durability:
Transactions Only valid Transactions When written
too many users.
are all or data is saved do not affect data will not
– The effort to maintain ACID properties may lead to a nothing each other be lost
significant slowdown of the system
43
S Ortega-Martorell
43
CAP theorem
• A distributed computer system cannot simultaneously achieve:
– Consistency:
• Every read receives the most recent write or an error
– Availability:
• Every request receives a (non-error) response, without the guarantee that it contains the most recent
write
– Partition tolerance
• The system continues to operate despite an arbitrary number of messages being dropped (or delayed)
by the network between nodes
44
S Ortega-Martorell
44
22
CAP theorem Availability
Remains accessible and
operational at all times
CA AP
Pick two!
45
[Link] 46
S Ortega-Martorell
46
23
Retrieving Big Data
47
S Ortega-Martorell
47
48
S Ortega-Martorell
48
24
SQL
• Example
– Beer Drinkers Club that owns many bars, and each bar sells beer.
– Not every bar sells the same brands of beer, and even when they do, they may have different prices.
– It keeps information about the regular member customers.
– It also knows which member visits which bars, and which beer each member likes.
• Database schema
49
S Ortega-Martorell
49
Select-from-where
• Which beers are made by Heineken?
From Data Operations, this form of
Output attribute(s) query can also be represented as:
SELECT name
FROM Beers Table(s) to use
WHERE manf = ‘Heineken’
50
S Ortega-Martorell
50
25
More example queries
• Find expensive beers
51
S Ortega-Martorell
51
… …
52
S Ortega-Martorell
52
26
Select-Project queries in large tables
… …
53
SELECT *
Thus, so long as the system knows the partitioning
FROM Beers
strategy, it can make its job much more efficient
WHERE name like ‘Am%’
54
S Ortega-Martorell
54
27
Evaluating Select-Project queries for large data
• Let’s look at the second query in the same partition setting
– The query condition is on the second attribute, manf, so the same trick cannot be applied.
– As it stands, this time we need to look up in all partitions.
– It can be done in parallel.
SELECT name
FROM Beers
WHERE manf = ‘Heineken’
55
S Ortega-Martorell
55
• Yes: to do this, it would need one more piece in the solution, called an index structure.
– Very simply, an index can be thought of as a reverse table,
where given the value in a column, you would get back
the records where the value appears.
– Using an index speeds up query processing significantly.
– With indexes, we can solve this problem in different ways:
• Use local index on each machine
• Use a machine index for each value
• Use a combined index in a global index server
56
28
Conclusions
57
S Ortega-Martorell
57
Summary
1. Summarised the key characteristics of a data stream and identified the requirements of
streaming data systems.
2. Described how Data Lakes enable batch processing of streaming data, and explained the
difference between ‘schema-on-write’ and ‘schema-on-read’.
3. Explained how data streams, data lakes, and data warehouse are organised on a spectrum
of a big data management and storage.
4. Explained the advantages of using DBMS over a file system; specified the differences
between a parallel and a distributed file system; and described a MapReduce-style DBMS.
5. Explained the desirable characteristics of a Big Data Management System (BDMS), gave
examples of BDMSs, and described their similarities and differences.
58
S Ortega-Martorell
58
29