Big Data ecosystem
M. Fanilo Andrianasolo
Data Analytics Tech Lead & Product Manager
2013 - 2016 2016 - 2017 2017-2019 2019-current
Big Data Data Analytics Data Science Data Analytics
Engineer Evangelist Tech Lead Product Manager
Data is at the center of all IT activities
Data Hardware Innovations
Explosion of data
3Vs of Big Data
Volume
Big Data Variability
Value
Velocity
Vulnerability
Vwhatever..
Variety
Problem
Daily rate in 2014
21,000 $ 600 TB 75 days
Cost of 1 TB Time to read 1 TB
~35$ ~3 hours
How should we store and query such data ?
Scaling ?
Vertical scaling Horizontal scaling
Less power PRICE Much cheaper Bigger energy footprint
consumption, cooling
costs Hardware failure causes Easier fault-tolerance Higher utility cost
bigger outages (electricity, cooling)
Less challenging to “Easier” upgrade by
implement Vendor lock-in adding new machines More networking
equipment
Less licencing costs Limited upgradeability
(Sometimes) less
network hardware
Scaling is hard
Big Data ecosystem
Apache Hadoop
Open-source software for reliable, scalable, distributed computing
Apache Hadoop ecosystem
More than 30 open source projects for managing and analyzing Big Data
…
Hadoop distributions
Hadoop distributions vs Cloud providers
Hadoop ecosystem use cases
Web indexing from web crawlers
Playlist generation from every listens
Log analysis
Product recommendation from purchases
A data platform canvas
Acquisition Transport Storage Processing Servicing
Security
Orchestration
A data platform canvas
Acquisition Transport Storage Processing Servicing
Security
Orchestration
Acquisition
Acquisition
Import
Hadoop
RDBMS
FS
Export
A data platform canvas
Acquisition Transport Storage Processing Servicing
Security
Orchestration
Transport
Transport
{
"type" : "record",
"namespace" : "test",
"name" : "Employee", emp e1=new emp( );
"fields" : [ [Link]("omar");
{ "name" : "Name" , "type" : "string" }, [Link](21);
{ "name" : "Age" , "type" : "int" }
]
}
.ascv .java
A data platform canvas
Acquisition Transport Storage Processing Servicing
Security
Orchestration
Hadoop Distributed File System
NameNode
DataNode1 DataNode2 DataNode3
block 1 block 2 block 1
block 2 block 1 block 1
block 1 block 1 block 2
HBase
Key U:cookie U:is_auth U:has_t P:Product1 P:Product2 P:Product3
1960:Fanilo c13e 1 3
2001:Fanilo c13e 1
1990:Omar d45 1
A data platform canvas
Acquisition Transport Storage Processing Servicing
Security
Orchestration
YARN – Yet Another Resource Negotiator
YARN hides the resource management details from the user to
facilitate the management of parallel applications.
Batch processing - Map Reduce
Data locality : Moving Computation is Cheaper than Moving Data
Batch processing
music_sales.csv
1, « Let it go », 4.99€, 5
2, « Snow », 7.99€, 1
HiveQL MapReduce
3, « Lion King », 0.99€, 1
4, « SISE », 1.99€, 2
5, « Lyon is great », 2.99€, 3
Metastore
Batch processing
music_sales.csv
recordings = LOAD '$file' USING PigStorage(',') AS 1, « Let it go », 4.99€, 5
(id, price, artist, title,
duration, year);
limit = LIMIT recordings $size;
2, « Snow », 7.99€, 1
DUMP limit;
3, « Lion King », 0.99€, 1
4, « SISE », 1.99€, 2
MapReduce 5, « Lyon is great », 2.99€, 3
Batch processing
Realtime processing
Realtime processing
A data platform canvas
Acquisition Transport Storage Processing Servicing
Security
Orchestration
Visualizing
Visualizing
A data platform canvas
Acquisition Transport Storage Processing Servicing
Security
Orchestration
Security
Kerberos
Orchestration
Orchestration
Overview
Acquisition Transport Storage Processing Servicing
Security
Orchestration
Architecture design
Multiple architectures
Lambda architecture
Kappa architecture
CONCLUSION
THANKS
@andfanilo
@andfanilo
andfanilo@[Link]