Critical Study of Hadoop Implementation and Performance Issues

Madhavi Vaidya

Critical Study of Hadoop Implementation and Performance Issues

Madhavi Vaidya

visibility

…

description

8 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

The MapReduce model has become an important parallel processing model for largescale data-intensive applications like data mining and web indexing. Hadoop, an opensource implementation of MapReduce, is widely applied to support cluster computing jobs requiring low response time. The different issues of Hadoop are discussed here and then for them what are the solutions which are proposed in the various papers which are studied by the author are discussed here. Finally, Hadoop is not an easy environment to manage. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Network delays due to data movement during running time have been ignored in the recent Hadoop research. Unfortunately, both the homogeneity and data locality assumptions in Hadoop are optimistic at best and unachievable at worst, introduces performance problems in virtualized data centers. The analysis of SPOF existing in critical nodes of Hadoop and proposes a metadata replication based solution to enable Hadoop high availability. The goal of heterogeneity can be achieved by a data placement scheme which distributes and stores data across multiple heterogeneous nodes based on their computing capacities. Analysts said that IT using the technology to aggregate and store data from multiple sources can create a whole slew of problems related to access control and ownership. Applications analyzing merged data in a Hadoop environment can result in the creation of new datasets that may also need to be protected.

Figures (3)

During normal operation DataNodes send heartbeats to the NameNode. The default heartbeat interval is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the DataNode to be out of service. The NameNode schedules creation of new replicas of those blocks on other DataNodes. [3] ee a eee ee eee eee Hadoop MapReduce is a framework that can be used for executing applications containing vast amounts of data (terabytes of data) in parallel on largely built clusters with numerous nodes in a reliable and fault-tolerant manner. Though it can be executed in a single machine, its true power lies in its ability to scale to several thousands of systems each with several processor cores. Hadoop is designed in such a way that it distributes data efficiently across various nodes in the cluster. It includes a distributed file system that takes care of distributing the huge amount of data sets efficiently across the nodes in the cluster.

The report consists of status and diagnostic

Kamalpreet Singh

2014 IEEE International Advance Computing Conference (IACC), 2014

Hadoop is an open source cloud computing platform of the Apache Foundation that provides a software programming framework called MapReduce and distributed file system, HDFS. It is a Linux based set of tools that uses commodity hardware, which are relatively inexpensive, to handle, analyze and transform large quantity of data. Hadoop Distributed File System, HDFS, stores huge data set reliably and streams it to user application at high bandwidth and MapReduce is a framework that is used for processing massive data sets in a distributed fashion over a several machines. This paper gives a brief overview of Big Data, Hadoop MapReduce and Hadoop Distributed File System along with its architecture.

Log In

Critical Study of Hadoop Implementation and Performance Issues

Sign up for access to the world's latest research

Abstract

Figures (3)

Related papers

Related papers

Related topics