AWS Plus Common Big Data Notes

Uploaded by

nanthinimohankumar13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views3 pages

AWS Plus Common Big Data Notes

Uploaded by

nanthinimohankumar13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

1.

We can see the necessity of configuring the Kafka for your big data cluster, what is going to
be the exact use case of using Kafka and is there any business necessity to configure Kafka
on normal compute because there were many solution for steaming data to your EMR
cluster like the below, (so kindly brief us about the exact use case of the Kafka in very
detailed manner)

 Kafka on compute can be used to stream data to your EMR cluster

 AWS MSK (AWS Managed Service for Kafka ) can be used to stream data to
your EMR cluster

2. Hive can be used for querying date with Structured Query Language, but we have seen an
added component of SQL in the requirement section. So by SQL what you would be needing
us to configure, either spark SQL for querying the data or for any other use case or you are
going to proceed with hive for querying with Structured Query Language?

3. Apache Storm and Kafka can process very high volumes of real-time data in a distributed
environment with a fault-tolerant manner but as an alternate solution to this we can go
ahead with the any of Kafka on compute or AWS MSK and stream the data to the EMR
cluster with the help of the spark streaming in both of the Kafka deployment scenario. So in
order for us to process the correct solution for your use case, kindly brief about the purpose
of your storm and Kafka deployment so that we can work on finding the relevant solution!

4. When it comes to file system decision for the EMR workload there are two possible
deployment strategies as stated below, but in order for us to finalise on the solution kindly
tell us about the data processing frequency and whether only batch processing is going to
happen (so cluster will be made to run only for certain hours per day) or the cluster will be
running for more than 18 hours per day of 24x7?

File system deployment strategies for the EMR,

 Deployment with EMR having the default EBS based HDFS or

 Deployment with S3 as a storage backend via EMRFS based connection

Reference links,
Kafka on top of compute to stream data to EMR - https://aws.amazon.com/blogs/big-data/real-
time-stream-processing-using-apache-spark-streaming-and-apache-kafka-on-aws/
Kafka used as AWS MSK to stream data to EMR - https://github.com/aws-samples/analysing-
realtime-streaming-data-using-msk-emr

Notes,
Hadoop is local disk IO (both used map reduce)
Spark use in memory (R-series recommended)

The key difference between Hadoop MapReduce and Spark lies in the approach to processing: Spark
can do it in-memory, while Hadoop MapReduce has to read from and write to a disk. As a result, the
speed of processing differs significantly – Spark may be up to 100 times faster.
Creating and submitting Jobs to EMR cluster:
https://kulasangar.medium.com/create-an-emr-cluster-and-submit-a-job-using-boto3-c34134ef68a0

EC2 with hadoop architecture,

Hint: Sync for incoming and o/p:

See definitions:
HDFS (HDFS-HBase Zoo Keeper)
Map reduce (is it a data processing application language package)
Yarn
Hive
Spark
Python
R
SQL (is it needs to be in EMR node or RDS or EC2 with DB)
Pig
Kafka
Storm

Note:
S3 (not a file system) is five time cheaper than HDFS ()
S3 is eventual consistency, so if one node is writing data into the EMR cluster then at the same time
other node reads it then it might show that it doesn’t exist because of eventual consistency
HDFS (350 mbps read speed per node and 200 mbps write speed per node)

Pre Requisites for solution design:

Identify the storage mechanism which is going to be used for the EMR cluster?

Processing Logs with Amazon EMR
No ratings yet
Processing Logs with Amazon EMR
29 pages
Launching a 1000 Node EMR Cluster
No ratings yet
Launching a 1000 Node EMR Cluster
53 pages
Module3 5
No ratings yet
Module3 5
11 pages
Storage: The Node Types in Amazon EMR Are As Follows
No ratings yet
Storage: The Node Types in Amazon EMR Are As Follows
10 pages
Amazon EMR Serverless Architecture and Use Cases
No ratings yet
Amazon EMR Serverless Architecture and Use Cases
6 pages
Understanding Amazon EMR Architecture
No ratings yet
Understanding Amazon EMR Architecture
14 pages
Amazon EMR: Hadoop Data Processing Guide
No ratings yet
Amazon EMR: Hadoop Data Processing Guide
16 pages
How Are Hadoop and Big Data Related?
No ratings yet
How Are Hadoop and Big Data Related?
18 pages
Downloaded Oct24 Lab5 Latestmanual
No ratings yet
Downloaded Oct24 Lab5 Latestmanual
24 pages
Managed Resource Scaling in Amazon Emr
No ratings yet
Managed Resource Scaling in Amazon Emr
13 pages
Mock Analytics Question Service Configuration Instructions
No ratings yet
Mock Analytics Question Service Configuration Instructions
5 pages
Real Time Analytics With Apache Kafka and Spark: @rahuldausa
No ratings yet
Real Time Analytics With Apache Kafka and Spark: @rahuldausa
54 pages
Amazon EMR
No ratings yet
Amazon EMR
6 pages
Big Data Tools: Hadoop, Spark, Kafka
No ratings yet
Big Data Tools: Hadoop, Spark, Kafka
4 pages
Building Data Pipelines with EMR & MWAA
No ratings yet
Building Data Pipelines with EMR & MWAA
26 pages
Real-Time Analytics with Kafka & Spark
100% (1)
Real-Time Analytics with Kafka & Spark
54 pages
AWS Interview Questions and Answers
No ratings yet
AWS Interview Questions and Answers
16 pages
Amazon Emr Migration Guide
No ratings yet
Amazon Emr Migration Guide
167 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Event Streams - Sales Enablement
No ratings yet
Event Streams - Sales Enablement
17 pages
How To Configure Big Data Management 10.1 For Amazon EMR 4.6
No ratings yet
How To Configure Big Data Management 10.1 For Amazon EMR 4.6
10 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Sala Questions
No ratings yet
Sala Questions
38 pages
Submitting PySpark Apps on AWS EMR
No ratings yet
Submitting PySpark Apps on AWS EMR
7 pages
01 - Chapter Introduction To AMQ Streams
No ratings yet
01 - Chapter Introduction To AMQ Streams
10 pages
Amazon Emr Migration Guide
No ratings yet
Amazon Emr Migration Guide
141 pages
Role of Hadoop in Big Data Analysis
No ratings yet
Role of Hadoop in Big Data Analysis
9 pages
Streaming Data and Stream Processing With Apache Kafka ™: David Tucker, Director of Partner Engineering
No ratings yet
Streaming Data and Stream Processing With Apache Kafka ™: David Tucker, Director of Partner Engineering
44 pages
ColorImages
No ratings yet
ColorImages
56 pages
Analytics On Big Fast Data Using A Realtime Stream Data Processing Architecture
No ratings yet
Analytics On Big Fast Data Using A Realtime Stream Data Processing Architecture
34 pages
Demystifying The Big Data Ecosystem... - Param Natarajan
100% (1)
Demystifying The Big Data Ecosystem... - Param Natarajan
8 pages
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
No ratings yet
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
100 pages
Hadoop Ecosystem Overview and Setup
No ratings yet
Hadoop Ecosystem Overview and Setup
48 pages
CMAQv5.4 On AWS-08-2023
No ratings yet
CMAQv5.4 On AWS-08-2023
30 pages
Cloud Computing Lab4 Kittu
No ratings yet
Cloud Computing Lab4 Kittu
15 pages
AWS-BIGD Big Data On AWS
No ratings yet
AWS-BIGD Big Data On AWS
5 pages
TA3 Big Data Analytics
No ratings yet
TA3 Big Data Analytics
13 pages
AWS Big Data Workshop: 35-Hour Course
No ratings yet
AWS Big Data Workshop: 35-Hour Course
2 pages
Overview of Hadoop and Spark Ecosystem
No ratings yet
Overview of Hadoop and Spark Ecosystem
14 pages
Big Data
No ratings yet
Big Data
27 pages
EMATM0051 2022 W8L2 Hadoop
No ratings yet
EMATM0051 2022 W8L2 Hadoop
92 pages
Amazon Emr Management Guide
No ratings yet
Amazon Emr Management Guide
314 pages
BigData Materials
No ratings yet
BigData Materials
68 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Berkeley Data Analytics Stack
No ratings yet
Berkeley Data Analytics Stack
48 pages
Ppb1 Workshop Streaming
No ratings yet
Ppb1 Workshop Streaming
64 pages
Big Data Processing and Tools Guide
No ratings yet
Big Data Processing and Tools Guide
11 pages
AWS Data Ingestion Workshop Guide
No ratings yet
AWS Data Ingestion Workshop Guide
43 pages
Apache Hadoop Developer Training
100% (1)
Apache Hadoop Developer Training
394 pages
Apache Hadoop Developer Training PDF
No ratings yet
Apache Hadoop Developer Training PDF
394 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
13 pages
Apache Hadoop Training
No ratings yet
Apache Hadoop Training
377 pages
Overview of the Hadoop Ecosystem
No ratings yet
Overview of the Hadoop Ecosystem
5 pages
Hadoop Block Size and Application Master
No ratings yet
Hadoop Block Size and Application Master
4 pages
Real Time Analytics Spark Streaming PDF
No ratings yet
Real Time Analytics Spark Streaming PDF
20 pages
Aiops Multi-Cloud Engineering
No ratings yet
Aiops Multi-Cloud Engineering
18 pages
Hadoop Ecosystem for Big Data Solutions
No ratings yet
Hadoop Ecosystem for Big Data Solutions
38 pages
CineVault Project Documentation
No ratings yet
CineVault Project Documentation
15 pages
Souvik Chakraborty - Multi
No ratings yet
Souvik Chakraborty - Multi
8 pages
FTTO: Efficient, Green Cabling Solutions
No ratings yet
FTTO: Efficient, Green Cabling Solutions
48 pages
Crosby Sizing
100% (2)
Crosby Sizing
89 pages
SAP Event Mesh Limitations Explained
No ratings yet
SAP Event Mesh Limitations Explained
6 pages
BE1 1051GuideformSpec
No ratings yet
BE1 1051GuideformSpec
4 pages
Project Financial Management Guide
No ratings yet
Project Financial Management Guide
5 pages
8254 Timer Programming Guide
No ratings yet
8254 Timer Programming Guide
27 pages
B.Tech Computer Science Course Structure
No ratings yet
B.Tech Computer Science Course Structure
1 page
Microcontrollers Lab Manual - East Point
No ratings yet
Microcontrollers Lab Manual - East Point
42 pages
Aeg Ps Acm1000 Pages 1
0% (3)
Aeg Ps Acm1000 Pages 1
2 pages
Chief Data Officer
No ratings yet
Chief Data Officer
17 pages
Animated Sankey Diagram Example
No ratings yet
Animated Sankey Diagram Example
8 pages
CSCE 5150 Mid-Term Exam Spring 2022
No ratings yet
CSCE 5150 Mid-Term Exam Spring 2022
26 pages
Apfs Stats
No ratings yet
Apfs Stats
11 pages
CCNP Security Identity Management SISE 300-715 Official Cert Guide
67% (3)
CCNP Security Identity Management SISE 300-715 Official Cert Guide
1,190 pages
Creo Interview Question
No ratings yet
Creo Interview Question
2 pages
Course List - Fall - 25-26 - 2022
No ratings yet
Course List - Fall - 25-26 - 2022
1 page
1-Elements of ECDIS
No ratings yet
1-Elements of ECDIS
24 pages
Spring Integration Reference
No ratings yet
Spring Integration Reference
226 pages
8051 Microcontroller Lock System
100% (1)
8051 Microcontroller Lock System
3 pages
Student Placement System
100% (1)
Student Placement System
42 pages
Manual Testing 106 Ques.
0% (1)
Manual Testing 106 Ques.
11 pages
(Download PDF) Advanced Data Analytics Using Python With Architectural Patterns Text and Image Classification and Optimization Techniques 2Nd Edition Sayan Mukhopadhyay Full Chapter PDF
100% (23)
(Download PDF) Advanced Data Analytics Using Python With Architectural Patterns Text and Image Classification and Optimization Techniques 2Nd Edition Sayan Mukhopadhyay Full Chapter PDF
70 pages
External Modem Installation Guide
No ratings yet
External Modem Installation Guide
53 pages
Virtual C PDF
No ratings yet
Virtual C PDF
128 pages
Catalyst 2960X/XR Upgrade Issues
No ratings yet
Catalyst 2960X/XR Upgrade Issues
2 pages
Learning Guide ILS3A01 2025 v4
No ratings yet
Learning Guide ILS3A01 2025 v4
16 pages
November - Desember 2008 Shared Processing Center (SPC)
No ratings yet
November - Desember 2008 Shared Processing Center (SPC)
27 pages
Different Way To Retrieve BSEG Table
No ratings yet
Different Way To Retrieve BSEG Table
7 pages

AWS Plus Common Big Data Notes

Uploaded by

AWS Plus Common Big Data Notes

Uploaded by

1.

 Kafka on compute can be used to stream data to your EMR cluster

File system deployment strategies for the EMR,

 Deployment with EMR having the default EBS based HDFS or

 Deployment with S3 as a storage backend via EMRFS based connection

EC2 with hadoop architecture,

Hint: Sync for incoming and o/p:

Pre Requisites for solution design:

You might also like