Academia.eduAcademia.edu

Abstract

The MapReduce model has become an important parallel processing model for largescale data-intensive applications like data mining and web indexing. Hadoop, an opensource implementation of MapReduce, is widely applied to support cluster computing jobs requiring low response time. The different issues of Hadoop are discussed here and then for them what are the solutions which are proposed in the various papers which are studied by the author are discussed here. Finally, Hadoop is not an easy environment to manage. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Network delays due to data movement during running time have been ignored in the recent Hadoop research. Unfortunately, both the homogeneity and data locality assumptions in Hadoop are optimistic at best and unachievable at worst, introduces performance problems in virtualized data centers. The analysis of SPOF existing in critical nodes of Hadoop and proposes a metadata replication based solution to enable Hadoop high availability. The goal of heterogeneity can be achieved by a data placement scheme which distributes and stores data across multiple heterogeneous nodes based on their computing capacities. Analysts said that IT using the technology to aggregate and store data from multiple sources can create a whole slew of problems related to access control and ownership. Applications analyzing merged data in a Hadoop environment can result in the creation of new datasets that may also need to be protected.