0% found this document useful (0 votes)

18 views11 pages

Big Data

The document discusses the significance of big data and the role of Apache Hadoop in processing large datasets efficiently. It outlines Hadoop's architecture, including its components like HDFS, MapReduce, and YARN, while also highlighting its strengths such as scalability and cost-effectiveness, as well as weaknesses like latency and security vulnerabilities. Additionally, it compares Hadoop with Spark and presents practical applications across various sectors, emphasizing the challenges faced in security and governance.

Uploaded by

pajejaj803

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views11 pages

Big Data

Uploaded by

pajejaj803

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Big Data Assessment-1

Big Data Analysis Using Apache Hadoop

1. Introduction
The amount of data created everyday is rising dramatically. Large data sets,
measured in zettabytes, are typically referred to as "big data". Government
authorities, businesses, and other organisations work to gather and preserve data
about their people and clients in an effort to better understand them and forecast
future behaviour. Websites like Facebook, Twitter and Social networking sites
produce new information every second, and one of the biggest challenges that
businesses have is managing this kind of data. Because data stored in data
warehouses is in a raw state and requires proper processing and analysis to deliver
valuable information, it is causing difficulties. In order to process vast volumes of
data within brief timeframes, various new tools are being implemented. The Java-
based Apache Hadoop programming platform can be used in networked computing
systems to process large data sets.
Hadoop is applied in systems with numerous nodes that have the capability to
manage terabytes of data. Hadoop utilises its proprietary file system, HDFS, to
present quick data transmissions that are resilient to node failure and avoid system
failure overall. Hadoop uses the MapReduce algorithm, which divides massive
amounts of information into smaller units and performs various processes on it.
To perform this task, various technologies will work jointly, which involves REST Web
services for communication, Spring Hadoop Data, Apache Maven for distributed
code construction Framework for basic functionality and Map-Reduce job execution,
and Apache Hadoop for distributed dataset processing.

2. Apache Hadoop
A Java-based software platform named Apache Hadoop facilitates massive data
processing on low-cost hardware clusters with distributed storage. Its computing
power and data storage capacity may grow from a single server to hundreds of
servers. Large-scale data applications such as mining, machine learning, and real-
time data processing typically deploy Hadoop, an open-source project. Considering
its capacity to handle huge amounts of data decisively and efficiently, it is an ideal
choice for businesses working with large datasets. (Rahman Rhythm et al., 2022b)
The open-source Hadoop framework has been published under the Apache License.
Data processing, storage, and management have been applied to various big data
applications that work on clustered computers. The "3Vs" were developed to
characterize big data, but currently, it includes the "5Vs," which are referred to as big
data characteristics/features.
https://azuretar.com/big-data-101-with-apache-spark-and-python/

Volume:
Big data refers to large amounts of data. Previously, it was either user-generated
data or by professionals. The scale of data that needs to be analysed is enormous
because it is being produced and processed for multiple kinds of functions by
machines, networks, and human interaction on channels like social media.
Previously, the organization's employees accounted for most of the amount of data;
today, data is gathered from clients, partners, Twitter, Facebook, Instagram, YouTube
and other websites and from employees. Imagine the immense amount of data
which gets created every day. Now consider a file that might vary from a few KB or
more than GB in size. This reflects the volume of data if thousands of files are
together.
Variety:
Variety refers to the many sources and types of data.
 Organised Information: Data that has been arranged into a prepared
repository, usually a database, is known as organised Information or
structured data. Organized data implies a database that includes data
structured such that it is easily accessible. RDBMS is a prime illustration of
how data are arranged in tables with rows and columns.
 Unorganised Data: Textual and non-textual data are both considered as
Unorganised Data or unstructured data forms. Examples of this type of data
include photographs, Portable Document Format (PDF), audio and video files,
and many other types of data sourced from social media platforms. These
websites include Facebook, Twitter, YouTube, WhatsApp, and others which
allow users to submit data in the form of likes, posts, comments, and
uploading pictures. Every element of this data is classified as unstructured
data because it cannot be stored in the form of rows and columns.
 Partially Organised Data: Metadata is all that resides within Partially
Organised Data or semi-structured data. Although they are not necessarily
ordered, metadata might be stored in databases. Let us take an example of
the HTML code. It consists of tags, each of which has a parent tag and a child
tag. We label this type of data as semi-structured data.
Velocity:
Big data Velocity pertains to the speed at which information flows into the system
from various sources, including corporate procedures, machinery, networks, and
human engagement with items like social media platforms and mobile devices.
Massive and constant data flow is occurring. Researchers and corporations can
make important judgements with the aid of this real-time data.
Velocity is the analysis of data in streaming form. How swiftly data is processed, for
instance, at the most extensive stock market in the world, the New York Stock
Market. You can only imagine how rapidly data needs to be processed given that it
generates, captures, and processes one terabyte of data every trading session. This
is associated with the velocity of streaming data analysis.
Veracity:
Big data the term "veracity" highlights the noise, biases, and abnormalities in data
that are being mined and stored in a way that makes sense for the issue being
studied. When it comes to analysis data, veracity presents the greatest hurdle when
compared to factors like volume and velocity.
Value:
The most crucial factor with Big data is value. However, big data has immense
functionality. Having access to big data is great, but it is unusable if we cannot do
something with it. Implementing IT infrastructure systems to hold large data becomes
increasingly expensive, and businesses will need to see a return on their investment.

2.1 Components of Hadoop

Hadoop consists of three primary components.
1. HDFS: Hadoop Distributed File System (HDFS) divides large datasets into
smaller blocks, distributing them across nodes in a cluster for enhanced
scalability (Bhosale, Devendra and Gadekar, 2014; Miguel, Caballé and
Xhafa, 2015). Specifically designed for deployment on commodity clusters,
HDFS ensures reliability through the replication of file data, achieving fault
tolerance by duplicating data across multiple nodes. This replication strategy
guarantees data availability, even in the event of node failures (Gunarathne et
al., 2010). The MapReduce programming model facilitates parallel
processing by breaking down complex tasks into smaller, independent sub-
tasks, processed in parallel across the nodes (Bhosale, Devendra and
Gadekar, 2014; Miguel, Caballé and Xhafa, 2015).

2. Map Reduce: Map Reduce is the heart of Hadoop. The concept of

MapReduce is simple to understand. In practical terms, the name
"MapReduce" indicates two separate and unique tasks that the Hadoop
program carries out. The map job is the initial process; it receives input data
information and utilises it to generate key/value pairs. These pairs are then
sorted and jumbled before being transmitted as input to the reduced task. To
get the results, the reduction job gathers those key/value pairs and gathers or
combines them. As stated by the name MapReduce, the reduce job is always
completed following the map job.

3. Yarn: Another Resource Negotiator (YARN) is responsible for resource

management in the Hadoop system. One of the key aspects of the Hadoop
ecosystem is yarn. Given its importance in workload administration and
tracking, YARN is often referred to as the Hadoop operating system. It
enables batch processing and real-time streaming of data stored on a single
platform by multiple data processing engines.
There are also other components like Hive, Pig, HBase and many more.
Other Components:

Hive: The open-source data warehousing programme Hive was developed on

top of Hadoop. It was created to enhance Hadoop's query capabilities and
facilitate more effective analysis of big datasets for end users. With Hive,
queries can be expressed in HiveQL, a declarative language that is
comparable to SQL and is compiled into Hadoop map-reduce tasks. It makes
working with data easier for users by introducing well-known ideas like tables,
columns, and partitions to the unstructured world of Hadoop. Metastore is a
system catalogue that comes with Hive. It has statistics and schemas that are
helpful for data exploration, query optimisation, and query compilation.
Additionally, it extends the functionality of the underlying IO libraries to query
data in new formats and enables users to plug in bespoke map-reduce scripts
into queries. Additionally, it extends the functionality of the underlying IO
libraries to query data in new formats and enables users to plug in bespoke
map-reduce scripts into queries.

Pig: For managing massive volumes of data, Apache Pig is an open-source

project built on top of the Hadoop MapReduce framework. It consists of a
runtime environment that runs on top of Hadoop and Pig Latin, a dataflow
language for specifying data processing operations. Pig has capabilities like
nested data models and rich user-defined functions to accommodate all types
of data: structured, unstructured, and semi-structured. It's commonly used in
businesses like Yahoo, Twitter, AOL, MapQuest, and LinkedIn for a variety of
purposes, such as data mining, log processing, and data querying. Pig has
the benefit of allowing developers to write fewer lines of code, which cuts
down on development and testing time.
HBase: A key element of the distributed, fault-tolerant, and scalable database
system known as HBase is the Hadoop Distributed File System (HDFS). It
provides read-write access in real time and is akin to Google's Big Table. Data
is stored in rows and columns in tables using the multiple small map
format.Table, row, column-family, column, and timestamp all uniquely identify
each cell. Along with its Java client API, HBase provides MapReduce jobs.
With integrations with Hive HBase and HBql, it allows for SQL-like querying
and allows additional indices. Petabyte-sized datasets can be scaled with
HBase, and it is linked with a variety of data sources with ease.

https://data-flair.training/blogs/hadoop-ecosystem-components/

2.2 Critical Analysis of Strengths and Weaknesses of Hadoop:

2.2.1 Advantages:
1. Flexibility and Scalability: In order to handle the increasing amount of data,
Hadoop may scale horizontally by adding more nodes to the cluster.. As the
data volume increases, Hadoop can seamlessly add more nodes to handle
the load, ensuring efficient processing. (Olson, 2010; Kitchin, 2014; Khalandar
et al., 2019).
2. Fault Tolerance: Data availability is ensured even in the event of node
failures because to HDFS's replication of data across nodes. HDFS also has
mechanisms for detecting and recovering from failures. It uses a combination
of heartbeats and block reports to monitor the health of the cluster and identify
failed nodes. When a node fails, HDFS automatically replicates the data
blocks stored on that node to maintain the desired replication factor (Kitchin,
2014; Khalandar et al., 2019; Elkawkagy and Elbeh, 2020).
3. Cost-Effectiveness: Hadoop operates on commodity hardware, which
consists of affordable, readily available hardware components, thereby
slashing infrastructure expenses (Barrett and Kipper, 2010). Through Hadoop,
organizations can streamline resource allocation, cutting down on manpower
and time requirements resulting in significant cost reductions. The distributed
storage and processing functionalities of Hadoop enable optimal utilization of
hardware resources, mitigating the necessity for costly hardware upgrades
(Khalandar et al., 2019)
2.2.2 Weaknesses:
1. Latency: Apache Hadoop's latency can be considered a weakness,
particularly in scenarios where low latency is critical, such as in finance and
other real-time applications. The MapReduce framework, which is the core of
Hadoop, is not designed for low-latency processing, making it less suitable for
interactive query processing and real-time data analysis (Yazidi et al., 2021).
2. Complexity: Hadoop has a steep learning curve, and managing clusters can
be complex for some organizations (Gurusamy, Kannan and Nandhini, 2017).
The complexities of Hadoop also include the need for efficient resource
utilisation, input splits, and shuffle operations (Ahmed et al., 2020).
3. Security Vulnerabilities: Research points out security issues within the
MapReduce framework of Apache Hadoop, such as lack of authentication and
unsecured communication between Hadoop daemons (Bhathal and Singh,
2019).

2.3 Hadoop vs Spark:

Features Hadoop Spark

Mode of Processing Batch processing is Batch and stream
facilitated with Hadoop. processing are both
supported by Spark.
Velocity Spark executes Compared to Spark,
calculations in memory, Hadoop computes using a
which makes it faster. hard drive, which is
slower.
Data Caching Caching of data in Caching of data in
memory is supported by memory is not supported
Hadoop. by Spark.
Protection Hadoop is seen as being Spark is seen as being
secure. less secure compared to
Hadoop.
Cost Hadoop is less expensive Spark is more expensive
as compared to Spark. than Hadoop since it
needs more Memory.
Languages of The majority of Hadoop Among the numerous
Programming runs on Java. programming languages
supported by Spark are
Scala, Java, Python, and
R.

2.4 Practical Applications:

2.4.1 Real-World Cases/Examples in Various Sectors:
1. Transportation: Hadoop plays a crucial role in the transportation sector, par-
ticularly in the analysis and optimization of traffic. By processing extensive
volumes of traffic data, it becomes possible to enhance route planning, allevi-
ate congestion, and boost overall transportation efficiency. Within the trans-
portation industry, big data is harnessed for various purposes such as route
optimization, effective fleet management, predictive maintenance, and the en-
hancement of logistics operations, as highlighted by Raj in 2018.
2. Petroleum: The oil and gas sector uses Hadoop for Big Data analytics in a
range of technological applications. It is employed to manage the massive
amount, diversity, speed, and the complexity of data produced by processes
like drilling, exploration, and production. Particularly Hadoop is used in oil and
gas exploration to process and analyse seismic data (According to Law In-
sider Seismic Data means an exploration method of sending energy waves or
sound waves into the earth and recording the wave reflections to indicate the
type, size, shape and depth of a subsurface rock formation.) that is collected
to create 2D and 3D representations of subsurface layers. Furthermore, real-
time data transmission from drilling instruments such as logging while drilling
(LWD) and measurement while drilling (MWD) is processed and analysed us-
ing Hadoop. Hadoop is also used in the downstream oil and gas sector to do
activities including boosting petrochemical asset management, optimising pro-
duction pump performance, and improving reservoir characterization and
modelling.
3. HealthCare: Three important processing scenarios are established by using
Hadoop, a potent technology for large-scale medical picture analysis. To
begin with, it is used in the Hadoop cluster for support vector machine (SVM)
parameter optimisation for lengthy texture sorting. By utilising support vector
machines (SVM), a well-known machine learning methodology, the best pa-
rameters for lung or any other texture classification are found using this
method. Finding the ideal SVM parameters takes a total of 10 hours to find,
because of the use of the MapReduce framework, which accelerates the pro-
cess.
Second, Hadoop is used in the MapReduce architecture for content-based
medical picture indexing. By adopting this technique, content-based medical
image indexing may be completed more quickly, proving that MapReduce is
an effective way of indexing lots of photos sequentially.
Finally, solid texture processing based on three-dimensional wavelets (wave-
like oscillations with an amplitude that begins with zero) is parallelized using
Hadoop, which greatly shortens the total runtime while preserving Hadoop
streaming compatibility. This method shortens the process's overall execution
time by accelerating wavelet analysis for solid texture categorization.

2.5 Challenges:
1. Security and Governance: Due to its distributed architecture and
dependencies on a wide range of technologies, that involve memory
resources, processors, operating systems, networks, databases, and
communication protocols, Hadoop presents a substantial security problem
(Rajeh, W.,2022). Any of these components could have a security issue that
puts the system as a whole at risk. A complicated ecosystem that is highly
vulnerable to attacks is the outcome of Hadoop's capacity for parallel
computation. Data is not entirely under the authority of a user since users
share physical resources with one another. The fact that data is stored across
numerous machines is a result of parallelism. Then, sharing physical devices
between a malicious actor and a client would be simple. Should sufficient
security measures not be taken, an adversary may get unrestricted access to
information and sabotage truthful consumers. Via a network, harmful data is
spread by compromised clients, potentially affecting the entire system.
2. Performance: To transport data across various nodes, Hadoop relies on
Remote Procedure Call over TCP/IP. It is easy for attackers to alter internode
communication in order to compromise the system because default
communication is insecure. One of the challenges in locating a single
computation precisely in a cluster is that computations can be carried out
anywhere. It is difficult to guarantee the security of every computing location
as a result. Moreover, data leakage that occurs during data transfer between
a DataNode and a client might be influenced by insecure communication.
Unwanted nodes are frequently added to systems with the intention of
stealing data or impairing computations.
3. Complexity: Because Hadoop is a complicated system, managing and
maintaining it takes a high level of skill. One of the major issues with Hadoop,
according to the report, is its complexity in parallelization and distribution.
New designs, approaches, algorithms, and analytics are required to manage it
and extract value and hidden knowledge from it.

2.6 Impact of Apache Hadoop on Efficiency and Decision-Making

Apache Hadoop has a transformative impact on efficiency and decision-making
across various industries. Its ability to process large datasets allows organizations to
derive valuable insights for informed decision-making (Buyya, Vecchiola and Selvi,
2013). For instance, in e-commerce, Hadoop is used to analyse customer behaviour
and preferences, enabling businesses to optimize product recommendations and
marketing strategies, ultimately improving overall operational efficiency and decision-
making processes.
Apache Hadoop's technical prowess in handling big data, its parallel and distributed
processing capabilities, and its practical applications across diverse sectors
underscore its significance in the field of big data analytics. While facing challenges,
Hadoop continues to be a driving force in empowering organizations to extract
meaningful insights from large datasets, thereby enhancing decision-making
processes and operational efficiencies.

3 Conclusion:
In summary, Apache Hadoop is a remarkable mechanism that transforms data
processing and decision-making procedures in a variety of industries. Hadoop
increases operational efficiencies and facilitates well-informed decision-making by
empowering organisations to derive important insights from massive information.
Hadoop is still a major player in the big data analytics space even in the face of
obstacles like security and governance concerns. It is a crucial component of the era
of data-driven insights and operational excellence due to its significant influence on
efficiency and decision-making.

References:
1. Ahmed, N., Barczak, A.L., Susnjak, T. and Rashid, M.A., 2020. A comprehen-
sive performance analysis of Apache Hadoop and Apache Spark for large
scale data sets using HiBench. Journal of Big Data, 7(1), p.110.

2. Anuradha, J., 2015. A brief introduction on Big Data 5Vs characteristics and
Hadoop technology. Procedia computer science, 48, pp.319-324.

3. Asri, H., Mousannif, H., Al Moatassime, H. and Noel, T., 2015, June. Big data
in healthcare: Challenges and opportunities. In 2015 International Conference
on Cloud Technologies and Applications (CloudTech) (pp. 1-7). IEEE.

4. Benlachmi, Y., El Yazidi, A. and Hasnaoui, M.L., 2021. A comparative analy-

sis of hadoop and spark frameworks using word count algorithm. International
Journal of Advanced Computer Science and Applications, 12(4), pp.778-788.

5. Bhathal, G.S. and Singh, A., 2019. Big data: Hadoop framework vulnerabili-
ties, security issues and attacks. Array, 1, p.100002.

6. Bhosale, H.S. and Gadekar, D.P., 2014. A review paper on big data and ha-
doop. International Journal of Scientific and Research Publications, 4(10),
pp.1-7.

7. Buyya, R., Vecchiola, C., Selvi, S.T., Buyya, R., Vecchiola, C. and Selvi, S.T.,
2013. Chapter 1-introduction. Mastering Cloud Computing, R. Buyya, C. Vec-
chiola, and ST Selvi, Eds. Boston: Morgan Kaufmann, pp.3-27.

8. Cloudlytics. (2021). Hadoop vs Spark: A Comparative Study. [online] Availa-

ble at: https://cloudlytics.com/hadoop-vs-spark-a-comparative-study/.
9. El Yazidi, A., Azizi, M.S., Benlachmi, Y. and Hasnaoui, M.L., 2021. Apache
Hadoop-MapReduce on YARN framework latency. Procedia Computer Sci-
ence, 184, pp.803-808.

10. GeeksforGeeks. (2020). Difference Between Hadoop and Spark. [online]

Available at: https://www.geeksforgeeks.org/difference-between-hadoop-and-
spark/.

11. Geroski, T., Jakovljević, D. and Filipović, N., 2023. Big Data in multiscale
modelling: from medical image processing to personalized models. Journal of
Big Data, 10(1), p.72.

12. Greeshma, L. and Pradeepini, G., 2016. Big data analytics with apache ha-
doop mapreduce framework. Indian Journal of Science and Technology.

13. Gunarathne, T., Wu, T.L., Qiu, J. and Fox, G., 2010, June. Cloud computing
paradigms for pleasingly parallel biomedical applications. In Proceedings of
the 19th ACM International Symposium on High Performance Distributed
Computing (pp. 460-469).

14. Gurusamy, V., Kannan, S. and Nandhini, K., 2017. The real time big data pro-
cessing framework: Advantages and limitations. International Journal of Com-
puter Sciences and Engineering, 5(12), pp.305-312.

15. Hannan, S.A., 2016. An overview on big data and hadoop. International Jour-
nal of Computer Applications, 154(10).

16. Kitchin, R., 2014. Big Data, new epistemologies and paradigm shifts. Big data
& society, 1(1), p.2053951714528481.

17. Macrometa. (n.d.). Apache Spark vs Hadoop - A detailed technical compari-

son. [online] Available at: https://www.macrometa.com/event-stream-pro-
cessing/apache-spark-vs-hadoop.

18. Markonis, D., Schaer, R., Eggel, I., Müller, H. and Depeursinge, A., 2012,
September. Using MapReduce for large-scale medical image analysis. In
2012 IEEE Second International Conference on Healthcare Informatics, Imag-
ing and Systems Biology (pp. 1-1). IEEE.

19. Mohammadpoor, M. and Torabi, F., 2020. Big Data analytics in oil and gas in-
dustry: An emerging trend. Petroleum, 6(4), pp.321-328.

20. Nandimath, J., Banerjee, E., Patil, A., Kakade, P., Vaidya, S. and Chaturvedi,
D., 2013, August. Big data analysis using Apache Hadoop. In 2013 IEEE 14th
International Conference on Information Reuse & Integration (IRI) (pp. 700-
703). IEEE.

21. Olson, M., 2010. Hadoop: Scalable, flexible data storage and analysis. IQT
Quart, 1(3), pp.14-18.
22. Polato, I., Ré, R., Goldman, A. and Kon, F., 2014. A comprehensive view of
Hadoop research—A systematic literature review. Journal of Network and
Computer Applications, 46, pp.1-25.

23. Rahman Rhythm, E., Ahmed Shuvo, R., Kabir Mehedi, M.H., Hossain, M.S.
and Alim Rasel, A. (2022). Distributed Computing for Big Data Analytics:Chal-
lenges and Opportunities. [online] researchgate. Available at: https://www.re-
searchgate.net/publication/366466213_Distributed_Compu-
ting_for_Big_Data_Analytics_Challenges_and_Opportunities?chan-
nel=doi&linkId=64841f02d702370600e655f0&showFulltext=true [Accessed 26
Mar. 2024].

24. Raj, P., 2018. The Hadoop ecosystem technologies and tools. In Advances in
computers (Vol. 109, pp. 279-320). Elsevier.

25. Rajeh, W., 2022. Hadoop distributed file system security challenges and ex-
amination of unauthorized access issue. Journal of Information Security,
13(2), pp.23-42.

26. Swarna, C. and Ansari, Z., 2017. Apache Pig-a data flow framework based on
Hadoop Map Reduce. International Journal of Engineering Trends and Tech-
nology (IJETT), 50(5), pp.271-275.

27. Taylor, R.C., 2010. An overview of the Hadoop/MapReduce/HBase framework

and its current applications in bioinformatics. BMC bioinformatics, 11, pp.1-6.

28. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony,
S., Liu, H. and Murthy, R., 2010, March. Hive-a petabyte scale data ware-
house using hadoop. In 2010 IEEE 26th international conference on data en-
gineering (ICDE 2010) (pp. 996-1005). IEEE.

Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Hadoop
No ratings yet
Hadoop
61 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
HDFS Node Types and User Interfaces
No ratings yet
HDFS Node Types and User Interfaces
15 pages
Big Data Glossary: Key Terms Explained
No ratings yet
Big Data Glossary: Key Terms Explained
2 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
63 pages
Bda Unit - 3
No ratings yet
Bda Unit - 3
15 pages
Unit 2
No ratings yet
Unit 2
9 pages
Hadoop
No ratings yet
Hadoop
21 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
BDA Final
No ratings yet
BDA Final
23 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Twitter Data Analysis with Hadoop
No ratings yet
Twitter Data Analysis with Hadoop
5 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
BDA Notes
No ratings yet
BDA Notes
18 pages
Biggdata
No ratings yet
Biggdata
24 pages
Big Data Open Source Framework-Hadoop
No ratings yet
Big Data Open Source Framework-Hadoop
22 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
Hadoop in Bigdata Processing Concept
No ratings yet
Hadoop in Bigdata Processing Concept
2 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Big Data Insights with Hadoop
No ratings yet
Big Data Insights with Hadoop
34 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
Big Data Emerging Technologie
No ratings yet
Big Data Emerging Technologie
10 pages
Coronel DatabaseSystems 13e Ch14
No ratings yet
Coronel DatabaseSystems 13e Ch14
30 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
28 pages
Sdcbdasparkweek1 1
No ratings yet
Sdcbdasparkweek1 1
9 pages
Hadoop and Big Data Explained
No ratings yet
Hadoop and Big Data Explained
4 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
History and Features of Hadoop
No ratings yet
History and Features of Hadoop
11 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
5 pages
Lect 2 Big Data Lesson01
No ratings yet
Lect 2 Big Data Lesson01
26 pages
An Insight On Big Data Analytics Using Pig Script
No ratings yet
An Insight On Big Data Analytics Using Pig Script
7 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Tools in Data Analytics
No ratings yet
Tools in Data Analytics
17 pages
Q. What Is Big Data?
No ratings yet
Q. What Is Big Data?
8 pages
Big Data and Hadoop Ecosystem Overview
No ratings yet
Big Data and Hadoop Ecosystem Overview
36 pages
BDA SansON Iat1
No ratings yet
BDA SansON Iat1
17 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Unit 2
No ratings yet
Unit 2
23 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Open Source Big Data Technologies Guide
No ratings yet
Open Source Big Data Technologies Guide
19 pages
Understanding Hadoop Ecosystem Components
No ratings yet
Understanding Hadoop Ecosystem Components
7 pages
UNIT-4-Hadoop Ecosystem-Part 1
No ratings yet
UNIT-4-Hadoop Ecosystem-Part 1
22 pages
Introduction To Big DAta
No ratings yet
Introduction To Big DAta
2 pages
Solr Slave Index vs. Master Size
No ratings yet
Solr Slave Index vs. Master Size
18 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Hadoop PPT
100% (1)
Hadoop PPT
25 pages
BDA Unit 3
No ratings yet
BDA Unit 3
7 pages
Module 2 Hadoop Final
No ratings yet
Module 2 Hadoop Final
98 pages
Hadoop Ecosystem: Overview
No ratings yet
Hadoop Ecosystem: Overview
5 pages
Hadoop for Big Data Enthusiasts
No ratings yet
Hadoop for Big Data Enthusiasts
42 pages
Data Science and Big Data UNIT 3
No ratings yet
Data Science and Big Data UNIT 3
11 pages
SMETA Training Session 2 For Buyers & Suppliers
No ratings yet
SMETA Training Session 2 For Buyers & Suppliers
42 pages
Filipino Proficiency Scores
No ratings yet
Filipino Proficiency Scores
18 pages
Amazon Seller Acronyms Explained
No ratings yet
Amazon Seller Acronyms Explained
1 page
Kyle Lesso: Music Educator Resume
No ratings yet
Kyle Lesso: Music Educator Resume
3 pages
Whitetopping Guidelines for Indian Roads
No ratings yet
Whitetopping Guidelines for Indian Roads
48 pages
In Process Quality Control Pharma Pathway
No ratings yet
In Process Quality Control Pharma Pathway
6 pages
Chemist Retail Business Guide
No ratings yet
Chemist Retail Business Guide
31 pages
Beyondheroesunlimiteduniverse15 The Bestiary 1
No ratings yet
Beyondheroesunlimiteduniverse15 The Bestiary 1
288 pages
DLL Week 2
No ratings yet
DLL Week 2
6 pages
Microsoft Fabric Data Engineer Interview Roadmap
No ratings yet
Microsoft Fabric Data Engineer Interview Roadmap
2 pages
Java Unit 3 Full Notes
No ratings yet
Java Unit 3 Full Notes
8 pages
1 - CEA - UAS Juni 2020
No ratings yet
1 - CEA - UAS Juni 2020
9 pages
Easy Math: Interactive Learning App
No ratings yet
Easy Math: Interactive Learning App
9 pages
AutoNEST FX V9.5.2.2 Update Guide
No ratings yet
AutoNEST FX V9.5.2.2 Update Guide
6 pages
Darby, R., "Size Safety-Relief Valves For Any Conditions", Chemical Engineering, 112, No. 9, PP 42-50, Sept, (2005)
No ratings yet
Darby, R., "Size Safety-Relief Valves For Any Conditions", Chemical Engineering, 112, No. 9, PP 42-50, Sept, (2005)
34 pages
Reflective Essay On Module
No ratings yet
Reflective Essay On Module
5 pages
The Geopolitics of Sport Beyond Soft Power
No ratings yet
The Geopolitics of Sport Beyond Soft Power
22 pages
MTTM C 202 Unit I
No ratings yet
MTTM C 202 Unit I
15 pages
Maths Chapter 1 and 2 Test
No ratings yet
Maths Chapter 1 and 2 Test
1 page
C Loop Control: Break & Continue
No ratings yet
C Loop Control: Break & Continue
4 pages
Map VOC Trade Network PDF
No ratings yet
Map VOC Trade Network PDF
1 page
Cot 1-Lesson Plan in Mapeh 7
No ratings yet
Cot 1-Lesson Plan in Mapeh 7
21 pages
Suspending Damage: A Letter To Communities, Eve Tuck
No ratings yet
Suspending Damage: A Letter To Communities, Eve Tuck
20 pages
Competition Law in India, USA and UK
No ratings yet
Competition Law in India, USA and UK
9 pages
AAR54 MWSyatem
100% (1)
AAR54 MWSyatem
2 pages
Sample Youth Driven Programming Proposal
No ratings yet
Sample Youth Driven Programming Proposal
4 pages
500 KVA Unit Auxiliary Transformer Specs
No ratings yet
500 KVA Unit Auxiliary Transformer Specs
11 pages
Estoque Filtros
No ratings yet
Estoque Filtros
2 pages
Manufacturing Processes II Week 01 - 2025 2026 Fall
No ratings yet
Manufacturing Processes II Week 01 - 2025 2026 Fall
48 pages
Woodsmith - 086
100% (3)
Woodsmith - 086
32 pages

Big Data

Uploaded by

Big Data

Uploaded by

Big Data Assessment-1

Big Data Analysis Using Apache Hadoop

2.1 Components of Hadoop

2. Map Reduce: Map Reduce is the heart of Hadoop. The concept of

3. Yarn: Another Resource Negotiator (YARN) is responsible for resource

Hive: The open-source data warehousing programme Hive was developed on

Pig: For managing massive volumes of data, Apache Pig is an open-source

2.2 Critical Analysis of Strengths and Weaknesses of Hadoop:

2.3 Hadoop vs Spark:

Features Hadoop Spark

2.4 Practical Applications:

2.6 Impact of Apache Hadoop on Efficiency and Decision-Making

4. Benlachmi, Y., El Yazidi, A. and Hasnaoui, M.L., 2021. A comparative analy-

8. Cloudlytics. (2021). Hadoop vs Spark: A Comparative Study. [online] Availa-

10. GeeksforGeeks. (2020). Difference Between Hadoop and Spark. [online]

17. Macrometa. (n.d.). Apache Spark vs Hadoop - A detailed technical compari-

27. Taylor, R.C., 2010. An overview of the Hadoop/MapReduce/HBase framework

You might also like