1.
We can see the necessity of configuring the Kafka for your big data cluster, what is going to
be the exact use case of using Kafka and is there any business necessity to configure Kafka
on normal compute because there were many solution for steaming data to your EMR
cluster like the below, (so kindly brief us about the exact use case of the Kafka in very
detailed manner)
Kafka on compute can be used to stream data to your EMR cluster
AWS MSK (AWS Managed Service for Kafka ) can be used to stream data to
your EMR cluster
2. Hive can be used for querying date with Structured Query Language, but we have seen an
added component of SQL in the requirement section. So by SQL what you would be needing
us to configure, either spark SQL for querying the data or for any other use case or you are
going to proceed with hive for querying with Structured Query Language?
3. Apache Storm and Kafka can process very high volumes of real-time data in a distributed
environment with a fault-tolerant manner but as an alternate solution to this we can go
ahead with the any of Kafka on compute or AWS MSK and stream the data to the EMR
cluster with the help of the spark streaming in both of the Kafka deployment scenario. So in
order for us to process the correct solution for your use case, kindly brief about the purpose
of your storm and Kafka deployment so that we can work on finding the relevant solution!
4. When it comes to file system decision for the EMR workload there are two possible
deployment strategies as stated below, but in order for us to finalise on the solution kindly
tell us about the data processing frequency and whether only batch processing is going to
happen (so cluster will be made to run only for certain hours per day) or the cluster will be
running for more than 18 hours per day of 24x7?
File system deployment strategies for the EMR,
Deployment with EMR having the default EBS based HDFS or
Deployment with S3 as a storage backend via EMRFS based connection
Reference links,
Kafka on top of compute to stream data to EMR - https://aws.amazon.com/blogs/big-data/real-
time-stream-processing-using-apache-spark-streaming-and-apache-kafka-on-aws/
Kafka used as AWS MSK to stream data to EMR - https://github.com/aws-samples/analysing-
realtime-streaming-data-using-msk-emr
Notes,
Hadoop is local disk IO (both used map reduce)
Spark use in memory (R-series recommended)
The key difference between Hadoop MapReduce and Spark lies in the approach to processing: Spark
can do it in-memory, while Hadoop MapReduce has to read from and write to a disk. As a result, the
speed of processing differs significantly – Spark may be up to 100 times faster.
Creating and submitting Jobs to EMR cluster:
https://kulasangar.medium.com/create-an-emr-cluster-and-submit-a-job-using-boto3-c34134ef68a0
EC2 with hadoop architecture,
Hint: Sync for incoming and o/p:
See definitions:
HDFS (HDFS-HBase Zoo Keeper)
Map reduce (is it a data processing application language package)
Yarn
Hive
Spark
Python
R
SQL (is it needs to be in EMR node or RDS or EC2 with DB)
Pig
Kafka
Storm
Note:
S3 (not a file system) is five time cheaper than HDFS ()
S3 is eventual consistency, so if one node is writing data into the EMR cluster then at the same time
other node reads it then it might show that it doesn’t exist because of eventual consistency
HDFS (350 mbps read speed per node and 200 mbps write speed per node)
Pre Requisites for solution design:
Identify the storage mechanism which is going to be used for the EMR cluster?