Big Data Assessment-1
Big Data Analysis Using Apache Hadoop
1. Introduction
The amount of data created everyday is rising dramatically. Large data sets,
measured in zettabytes, are typically referred to as "big data". Government
authorities, businesses, and other organisations work to gather and preserve data
about their people and clients in an effort to better understand them and forecast
future behaviour. Websites like Facebook, Twitter and Social networking sites
produce new information every second, and one of the biggest challenges that
businesses have is managing this kind of data. Because data stored in data
warehouses is in a raw state and requires proper processing and analysis to deliver
valuable information, it is causing difficulties. In order to process vast volumes of
data within brief timeframes, various new tools are being implemented. The Java-
based Apache Hadoop programming platform can be used in networked computing
systems to process large data sets.
Hadoop is applied in systems with numerous nodes that have the capability to
manage terabytes of data. Hadoop utilises its proprietary file system, HDFS, to
present quick data transmissions that are resilient to node failure and avoid system
failure overall. Hadoop uses the MapReduce algorithm, which divides massive
amounts of information into smaller units and performs various processes on it.
To perform this task, various technologies will work jointly, which involves REST Web
services for communication, Spring Hadoop Data, Apache Maven for distributed
code construction Framework for basic functionality and Map-Reduce job execution,
and Apache Hadoop for distributed dataset processing.
2. Apache Hadoop
A Java-based software platform named Apache Hadoop facilitates massive data
processing on low-cost hardware clusters with distributed storage. Its computing
power and data storage capacity may grow from a single server to hundreds of
servers. Large-scale data applications such as mining, machine learning, and real-
time data processing typically deploy Hadoop, an open-source project. Considering
its capacity to handle huge amounts of data decisively and efficiently, it is an ideal
choice for businesses working with large datasets. (Rahman Rhythm et al., 2022b)
The open-source Hadoop framework has been published under the Apache License.
Data processing, storage, and management have been applied to various big data
applications that work on clustered computers. The "3Vs" were developed to
characterize big data, but currently, it includes the "5Vs," which are referred to as big
data characteristics/features.
https://azuretar.com/big-data-101-with-apache-spark-and-python/
Volume:
Big data refers to large amounts of data. Previously, it was either user-generated
data or by professionals. The scale of data that needs to be analysed is enormous
because it is being produced and processed for multiple kinds of functions by
machines, networks, and human interaction on channels like social media.
Previously, the organization's employees accounted for most of the amount of data;
today, data is gathered from clients, partners, Twitter, Facebook, Instagram, YouTube
and other websites and from employees. Imagine the immense amount of data
which gets created every day. Now consider a file that might vary from a few KB or
more than GB in size. This reflects the volume of data if thousands of files are
together.
Variety:
Variety refers to the many sources and types of data.
Organised Information: Data that has been arranged into a prepared
repository, usually a database, is known as organised Information or
structured data. Organized data implies a database that includes data
structured such that it is easily accessible. RDBMS is a prime illustration of
how data are arranged in tables with rows and columns.
Unorganised Data: Textual and non-textual data are both considered as
Unorganised Data or unstructured data forms. Examples of this type of data
include photographs, Portable Document Format (PDF), audio and video files,
and many other types of data sourced from social media platforms. These
websites include Facebook, Twitter, YouTube, WhatsApp, and others which
allow users to submit data in the form of likes, posts, comments, and
uploading pictures. Every element of this data is classified as unstructured
data because it cannot be stored in the form of rows and columns.
Partially Organised Data: Metadata is all that resides within Partially
Organised Data or semi-structured data. Although they are not necessarily
ordered, metadata might be stored in databases. Let us take an example of
the HTML code. It consists of tags, each of which has a parent tag and a child
tag. We label this type of data as semi-structured data.
Velocity:
Big data Velocity pertains to the speed at which information flows into the system
from various sources, including corporate procedures, machinery, networks, and
human engagement with items like social media platforms and mobile devices.
Massive and constant data flow is occurring. Researchers and corporations can
make important judgements with the aid of this real-time data.
Velocity is the analysis of data in streaming form. How swiftly data is processed, for
instance, at the most extensive stock market in the world, the New York Stock
Market. You can only imagine how rapidly data needs to be processed given that it
generates, captures, and processes one terabyte of data every trading session. This
is associated with the velocity of streaming data analysis.
Veracity:
Big data the term "veracity" highlights the noise, biases, and abnormalities in data
that are being mined and stored in a way that makes sense for the issue being
studied. When it comes to analysis data, veracity presents the greatest hurdle when
compared to factors like volume and velocity.
Value:
The most crucial factor with Big data is value. However, big data has immense
functionality. Having access to big data is great, but it is unusable if we cannot do
something with it. Implementing IT infrastructure systems to hold large data becomes
increasingly expensive, and businesses will need to see a return on their investment.
2.1 Components of Hadoop
Hadoop consists of three primary components.
1. HDFS: Hadoop Distributed File System (HDFS) divides large datasets into
smaller blocks, distributing them across nodes in a cluster for enhanced
scalability (Bhosale, Devendra and Gadekar, 2014; Miguel, Caballé and
Xhafa, 2015). Specifically designed for deployment on commodity clusters,
HDFS ensures reliability through the replication of file data, achieving fault
tolerance by duplicating data across multiple nodes. This replication strategy
guarantees data availability, even in the event of node failures (Gunarathne et
al., 2010). The MapReduce programming model facilitates parallel
processing by breaking down complex tasks into smaller, independent sub-
tasks, processed in parallel across the nodes (Bhosale, Devendra and
Gadekar, 2014; Miguel, Caballé and Xhafa, 2015).
2. Map Reduce: Map Reduce is the heart of Hadoop. The concept of
MapReduce is simple to understand. In practical terms, the name
"MapReduce" indicates two separate and unique tasks that the Hadoop
program carries out. The map job is the initial process; it receives input data
information and utilises it to generate key/value pairs. These pairs are then
sorted and jumbled before being transmitted as input to the reduced task. To
get the results, the reduction job gathers those key/value pairs and gathers or
combines them. As stated by the name MapReduce, the reduce job is always
completed following the map job.
3. Yarn: Another Resource Negotiator (YARN) is responsible for resource
management in the Hadoop system. One of the key aspects of the Hadoop
ecosystem is yarn. Given its importance in workload administration and
tracking, YARN is often referred to as the Hadoop operating system. It
enables batch processing and real-time streaming of data stored on a single
platform by multiple data processing engines.
There are also other components like Hive, Pig, HBase and many more.
Other Components:
Hive: The open-source data warehousing programme Hive was developed on
top of Hadoop. It was created to enhance Hadoop's query capabilities and
facilitate more effective analysis of big datasets for end users. With Hive,
queries can be expressed in HiveQL, a declarative language that is
comparable to SQL and is compiled into Hadoop map-reduce tasks. It makes
working with data easier for users by introducing well-known ideas like tables,
columns, and partitions to the unstructured world of Hadoop. Metastore is a
system catalogue that comes with Hive. It has statistics and schemas that are
helpful for data exploration, query optimisation, and query compilation.
Additionally, it extends the functionality of the underlying IO libraries to query
data in new formats and enables users to plug in bespoke map-reduce scripts
into queries. Additionally, it extends the functionality of the underlying IO
libraries to query data in new formats and enables users to plug in bespoke
map-reduce scripts into queries.
Pig: For managing massive volumes of data, Apache Pig is an open-source
project built on top of the Hadoop MapReduce framework. It consists of a
runtime environment that runs on top of Hadoop and Pig Latin, a dataflow
language for specifying data processing operations. Pig has capabilities like
nested data models and rich user-defined functions to accommodate all types
of data: structured, unstructured, and semi-structured. It's commonly used in
businesses like Yahoo, Twitter, AOL, MapQuest, and LinkedIn for a variety of
purposes, such as data mining, log processing, and data querying. Pig has
the benefit of allowing developers to write fewer lines of code, which cuts
down on development and testing time.
HBase: A key element of the distributed, fault-tolerant, and scalable database
system known as HBase is the Hadoop Distributed File System (HDFS). It
provides read-write access in real time and is akin to Google's Big Table. Data
is stored in rows and columns in tables using the multiple small map
format.Table, row, column-family, column, and timestamp all uniquely identify
each cell. Along with its Java client API, HBase provides MapReduce jobs.
With integrations with Hive HBase and HBql, it allows for SQL-like querying
and allows additional indices. Petabyte-sized datasets can be scaled with
HBase, and it is linked with a variety of data sources with ease.
https://data-flair.training/blogs/hadoop-ecosystem-components/
2.2 Critical Analysis of Strengths and Weaknesses of Hadoop:
2.2.1 Advantages:
1. Flexibility and Scalability: In order to handle the increasing amount of data,
Hadoop may scale horizontally by adding more nodes to the cluster.. As the
data volume increases, Hadoop can seamlessly add more nodes to handle
the load, ensuring efficient processing. (Olson, 2010; Kitchin, 2014; Khalandar
et al., 2019).
2. Fault Tolerance: Data availability is ensured even in the event of node
failures because to HDFS's replication of data across nodes. HDFS also has
mechanisms for detecting and recovering from failures. It uses a combination
of heartbeats and block reports to monitor the health of the cluster and identify
failed nodes. When a node fails, HDFS automatically replicates the data
blocks stored on that node to maintain the desired replication factor (Kitchin,
2014; Khalandar et al., 2019; Elkawkagy and Elbeh, 2020).
3. Cost-Effectiveness: Hadoop operates on commodity hardware, which
consists of affordable, readily available hardware components, thereby
slashing infrastructure expenses (Barrett and Kipper, 2010). Through Hadoop,
organizations can streamline resource allocation, cutting down on manpower
and time requirements resulting in significant cost reductions. The distributed
storage and processing functionalities of Hadoop enable optimal utilization of
hardware resources, mitigating the necessity for costly hardware upgrades
(Khalandar et al., 2019)
2.2.2 Weaknesses:
1. Latency: Apache Hadoop's latency can be considered a weakness,
particularly in scenarios where low latency is critical, such as in finance and
other real-time applications. The MapReduce framework, which is the core of
Hadoop, is not designed for low-latency processing, making it less suitable for
interactive query processing and real-time data analysis (Yazidi et al., 2021).
2. Complexity: Hadoop has a steep learning curve, and managing clusters can
be complex for some organizations (Gurusamy, Kannan and Nandhini, 2017).
The complexities of Hadoop also include the need for efficient resource
utilisation, input splits, and shuffle operations (Ahmed et al., 2020).
3. Security Vulnerabilities: Research points out security issues within the
MapReduce framework of Apache Hadoop, such as lack of authentication and
unsecured communication between Hadoop daemons (Bhathal and Singh,
2019).
2.3 Hadoop vs Spark:
Features Hadoop Spark
Mode of Processing Batch processing is Batch and stream
facilitated with Hadoop. processing are both
supported by Spark.
Velocity Spark executes Compared to Spark,
calculations in memory, Hadoop computes using a
which makes it faster. hard drive, which is
slower.
Data Caching Caching of data in Caching of data in
memory is supported by memory is not supported
Hadoop. by Spark.
Protection Hadoop is seen as being Spark is seen as being
secure. less secure compared to
Hadoop.
Cost Hadoop is less expensive Spark is more expensive
as compared to Spark. than Hadoop since it
needs more Memory.
Languages of The majority of Hadoop Among the numerous
Programming runs on Java. programming languages
supported by Spark are
Scala, Java, Python, and
R.
2.4 Practical Applications:
2.4.1 Real-World Cases/Examples in Various Sectors:
1. Transportation: Hadoop plays a crucial role in the transportation sector, par-
ticularly in the analysis and optimization of traffic. By processing extensive
volumes of traffic data, it becomes possible to enhance route planning, allevi-
ate congestion, and boost overall transportation efficiency. Within the trans-
portation industry, big data is harnessed for various purposes such as route
optimization, effective fleet management, predictive maintenance, and the en-
hancement of logistics operations, as highlighted by Raj in 2018.
2. Petroleum: The oil and gas sector uses Hadoop for Big Data analytics in a
range of technological applications. It is employed to manage the massive
amount, diversity, speed, and the complexity of data produced by processes
like drilling, exploration, and production. Particularly Hadoop is used in oil and
gas exploration to process and analyse seismic data (According to Law In-
sider Seismic Data means an exploration method of sending energy waves or
sound waves into the earth and recording the wave reflections to indicate the
type, size, shape and depth of a subsurface rock formation.) that is collected
to create 2D and 3D representations of subsurface layers. Furthermore, real-
time data transmission from drilling instruments such as logging while drilling
(LWD) and measurement while drilling (MWD) is processed and analysed us-
ing Hadoop. Hadoop is also used in the downstream oil and gas sector to do
activities including boosting petrochemical asset management, optimising pro-
duction pump performance, and improving reservoir characterization and
modelling.
3. HealthCare: Three important processing scenarios are established by using
Hadoop, a potent technology for large-scale medical picture analysis. To
begin with, it is used in the Hadoop cluster for support vector machine (SVM)
parameter optimisation for lengthy texture sorting. By utilising support vector
machines (SVM), a well-known machine learning methodology, the best pa-
rameters for lung or any other texture classification are found using this
method. Finding the ideal SVM parameters takes a total of 10 hours to find,
because of the use of the MapReduce framework, which accelerates the pro-
cess.
Second, Hadoop is used in the MapReduce architecture for content-based
medical picture indexing. By adopting this technique, content-based medical
image indexing may be completed more quickly, proving that MapReduce is
an effective way of indexing lots of photos sequentially.
Finally, solid texture processing based on three-dimensional wavelets (wave-
like oscillations with an amplitude that begins with zero) is parallelized using
Hadoop, which greatly shortens the total runtime while preserving Hadoop
streaming compatibility. This method shortens the process's overall execution
time by accelerating wavelet analysis for solid texture categorization.
2.5 Challenges:
1. Security and Governance: Due to its distributed architecture and
dependencies on a wide range of technologies, that involve memory
resources, processors, operating systems, networks, databases, and
communication protocols, Hadoop presents a substantial security problem
(Rajeh, W.,2022). Any of these components could have a security issue that
puts the system as a whole at risk. A complicated ecosystem that is highly
vulnerable to attacks is the outcome of Hadoop's capacity for parallel
computation. Data is not entirely under the authority of a user since users
share physical resources with one another. The fact that data is stored across
numerous machines is a result of parallelism. Then, sharing physical devices
between a malicious actor and a client would be simple. Should sufficient
security measures not be taken, an adversary may get unrestricted access to
information and sabotage truthful consumers. Via a network, harmful data is
spread by compromised clients, potentially affecting the entire system.
2. Performance: To transport data across various nodes, Hadoop relies on
Remote Procedure Call over TCP/IP. It is easy for attackers to alter internode
communication in order to compromise the system because default
communication is insecure. One of the challenges in locating a single
computation precisely in a cluster is that computations can be carried out
anywhere. It is difficult to guarantee the security of every computing location
as a result. Moreover, data leakage that occurs during data transfer between
a DataNode and a client might be influenced by insecure communication.
Unwanted nodes are frequently added to systems with the intention of
stealing data or impairing computations.
3. Complexity: Because Hadoop is a complicated system, managing and
maintaining it takes a high level of skill. One of the major issues with Hadoop,
according to the report, is its complexity in parallelization and distribution.
New designs, approaches, algorithms, and analytics are required to manage it
and extract value and hidden knowledge from it.
2.6 Impact of Apache Hadoop on Efficiency and Decision-Making
Apache Hadoop has a transformative impact on efficiency and decision-making
across various industries. Its ability to process large datasets allows organizations to
derive valuable insights for informed decision-making (Buyya, Vecchiola and Selvi,
2013). For instance, in e-commerce, Hadoop is used to analyse customer behaviour
and preferences, enabling businesses to optimize product recommendations and
marketing strategies, ultimately improving overall operational efficiency and decision-
making processes.
Apache Hadoop's technical prowess in handling big data, its parallel and distributed
processing capabilities, and its practical applications across diverse sectors
underscore its significance in the field of big data analytics. While facing challenges,
Hadoop continues to be a driving force in empowering organizations to extract
meaningful insights from large datasets, thereby enhancing decision-making
processes and operational efficiencies.
3 Conclusion:
In summary, Apache Hadoop is a remarkable mechanism that transforms data
processing and decision-making procedures in a variety of industries. Hadoop
increases operational efficiencies and facilitates well-informed decision-making by
empowering organisations to derive important insights from massive information.
Hadoop is still a major player in the big data analytics space even in the face of
obstacles like security and governance concerns. It is a crucial component of the era
of data-driven insights and operational excellence due to its significant influence on
efficiency and decision-making.
References:
1. Ahmed, N., Barczak, A.L., Susnjak, T. and Rashid, M.A., 2020. A comprehen-
sive performance analysis of Apache Hadoop and Apache Spark for large
scale data sets using HiBench. Journal of Big Data, 7(1), p.110.
2. Anuradha, J., 2015. A brief introduction on Big Data 5Vs characteristics and
Hadoop technology. Procedia computer science, 48, pp.319-324.
3. Asri, H., Mousannif, H., Al Moatassime, H. and Noel, T., 2015, June. Big data
in healthcare: Challenges and opportunities. In 2015 International Conference
on Cloud Technologies and Applications (CloudTech) (pp. 1-7). IEEE.
4. Benlachmi, Y., El Yazidi, A. and Hasnaoui, M.L., 2021. A comparative analy-
sis of hadoop and spark frameworks using word count algorithm. International
Journal of Advanced Computer Science and Applications, 12(4), pp.778-788.
5. Bhathal, G.S. and Singh, A., 2019. Big data: Hadoop framework vulnerabili-
ties, security issues and attacks. Array, 1, p.100002.
6. Bhosale, H.S. and Gadekar, D.P., 2014. A review paper on big data and ha-
doop. International Journal of Scientific and Research Publications, 4(10),
pp.1-7.
7. Buyya, R., Vecchiola, C., Selvi, S.T., Buyya, R., Vecchiola, C. and Selvi, S.T.,
2013. Chapter 1-introduction. Mastering Cloud Computing, R. Buyya, C. Vec-
chiola, and ST Selvi, Eds. Boston: Morgan Kaufmann, pp.3-27.
8. Cloudlytics. (2021). Hadoop vs Spark: A Comparative Study. [online] Availa-
ble at: https://cloudlytics.com/hadoop-vs-spark-a-comparative-study/.
9. El Yazidi, A., Azizi, M.S., Benlachmi, Y. and Hasnaoui, M.L., 2021. Apache
Hadoop-MapReduce on YARN framework latency. Procedia Computer Sci-
ence, 184, pp.803-808.
10. GeeksforGeeks. (2020). Difference Between Hadoop and Spark. [online]
Available at: https://www.geeksforgeeks.org/difference-between-hadoop-and-
spark/.
11. Geroski, T., Jakovljević, D. and Filipović, N., 2023. Big Data in multiscale
modelling: from medical image processing to personalized models. Journal of
Big Data, 10(1), p.72.
12. Greeshma, L. and Pradeepini, G., 2016. Big data analytics with apache ha-
doop mapreduce framework. Indian Journal of Science and Technology.
13. Gunarathne, T., Wu, T.L., Qiu, J. and Fox, G., 2010, June. Cloud computing
paradigms for pleasingly parallel biomedical applications. In Proceedings of
the 19th ACM International Symposium on High Performance Distributed
Computing (pp. 460-469).
14. Gurusamy, V., Kannan, S. and Nandhini, K., 2017. The real time big data pro-
cessing framework: Advantages and limitations. International Journal of Com-
puter Sciences and Engineering, 5(12), pp.305-312.
15. Hannan, S.A., 2016. An overview on big data and hadoop. International Jour-
nal of Computer Applications, 154(10).
16. Kitchin, R., 2014. Big Data, new epistemologies and paradigm shifts. Big data
& society, 1(1), p.2053951714528481.
17. Macrometa. (n.d.). Apache Spark vs Hadoop - A detailed technical compari-
son. [online] Available at: https://www.macrometa.com/event-stream-pro-
cessing/apache-spark-vs-hadoop.
18. Markonis, D., Schaer, R., Eggel, I., Müller, H. and Depeursinge, A., 2012,
September. Using MapReduce for large-scale medical image analysis. In
2012 IEEE Second International Conference on Healthcare Informatics, Imag-
ing and Systems Biology (pp. 1-1). IEEE.
19. Mohammadpoor, M. and Torabi, F., 2020. Big Data analytics in oil and gas in-
dustry: An emerging trend. Petroleum, 6(4), pp.321-328.
20. Nandimath, J., Banerjee, E., Patil, A., Kakade, P., Vaidya, S. and Chaturvedi,
D., 2013, August. Big data analysis using Apache Hadoop. In 2013 IEEE 14th
International Conference on Information Reuse & Integration (IRI) (pp. 700-
703). IEEE.
21. Olson, M., 2010. Hadoop: Scalable, flexible data storage and analysis. IQT
Quart, 1(3), pp.14-18.
22. Polato, I., Ré, R., Goldman, A. and Kon, F., 2014. A comprehensive view of
Hadoop research—A systematic literature review. Journal of Network and
Computer Applications, 46, pp.1-25.
23. Rahman Rhythm, E., Ahmed Shuvo, R., Kabir Mehedi, M.H., Hossain, M.S.
and Alim Rasel, A. (2022). Distributed Computing for Big Data Analytics:Chal-
lenges and Opportunities. [online] researchgate. Available at: https://www.re-
searchgate.net/publication/366466213_Distributed_Compu-
ting_for_Big_Data_Analytics_Challenges_and_Opportunities?chan-
nel=doi&linkId=64841f02d702370600e655f0&showFulltext=true [Accessed 26
Mar. 2024].
24. Raj, P., 2018. The Hadoop ecosystem technologies and tools. In Advances in
computers (Vol. 109, pp. 279-320). Elsevier.
25. Rajeh, W., 2022. Hadoop distributed file system security challenges and ex-
amination of unauthorized access issue. Journal of Information Security,
13(2), pp.23-42.
26. Swarna, C. and Ansari, Z., 2017. Apache Pig-a data flow framework based on
Hadoop Map Reduce. International Journal of Engineering Trends and Tech-
nology (IJETT), 50(5), pp.271-275.
27. Taylor, R.C., 2010. An overview of the Hadoop/MapReduce/HBase framework
and its current applications in bioinformatics. BMC bioinformatics, 11, pp.1-6.
28. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony,
S., Liu, H. and Murthy, R., 2010, March. Hive-a petabyte scale data ware-
house using hadoop. In 2010 IEEE 26th international conference on data en-
gineering (ICDE 2010) (pp. 996-1005). IEEE.