A Study of Heuristically-based Parametric Performance Optimization
Algorithms for BigData Computing
Jongyeop Kim ( 99999)
Departement of Math and Computer Science, Southern Arkansas University
100 E University, Magnolia, AR 71753, USA.
[email protected] Abstract. Performance optimization for mapreduce computing in Hadoop platform is a tedious yet
challenging problem due to the complexity of system organization with an extensive list of configuration
parameters to be considered. In order to address and resolve this problem, various parameter optimization
algorithms are proposed in this research from a naive exhaustive method to a random and a couple of
heuristically-based greedy methods to vie with the exponentially nature of the search process for the possible
best parameter setting. In the course of exercising those algorithms, there are a few variables to be taken into
consideration in order to make each algorithm be a viable option for the given other variables provided such
as degree of the arity of each parameter that determines the degree of the base of the search process time,
sampling methods in order to relax the complexity of the search process under control or the budget.
Keywords: Big Data, Hadoop, Configuration, Performance tuning.
1. Introduction
This research concerns about the performance optimization or improvement of computation running on top of
the platform, namely Apache Hadoop in particular. However, the methods and algorithms proposed in this
research supposed to be applicable to other platforms as well without loss of generality. The diagram below [1]
shows the computation flow on Hadoop, which primarily consists of one name node and multiple slave nodes.
A MapReduce process [2] is composed of two primary functions as follows: 1) Map function: takes a set of
input/key value pairs and produces a set of intermediate ouput key/value pairs; 2) Reduce function: takes
intermediate key2 and a set of values for the key as illustrated in the following. Map (Key1, Value1) -> List
(Key2, Value 2) Reduce (Key2, List(Value2)) -> List(Value2
2. E-R Modeling
A naïve exhaustive method [7] has been proposed as an initial attempt to the problem of this research and its
pros and cons are summarized in the following table. The pros of this approach is simple and exhaustive. Cons
of this approach is, as more parameters are considered, the benchmark time grows exponentially. And the
method is incapable of further optimization due to limited range of parameter values. The cost of this approach
is asymptotically a^p, where a is the number of values of parameter under consideration and p is the number of
parameters. Thus, the cost of the method is definitely upper bounded by exponential time.
3. Logical Database Design and relation model
In this section, it is explained the results of research and at the same time is given the comprehensive discussion.
Results can be presented in figures, graphs, tables and others that make the reader understand easily. The
discussion can be made in several sub-chapters.
A naïve exhaustive method has been proposed and a 2-Phased Exhaustive Sorting-based algorithm has been
developed and implemented [7]. The algorithm consists of two phases of sorting: the first phase sorting is
performed in order to obtain an extended list (e.g., top 50 parameter settings) of top performing parameter
settings for a rough sorting against a small size of data; and then the second phase sorting is performed for
1
detailed sorting against a large size of data instead for a short list (e.g., top 5 out of the above 50) of top
performing parameter settings. The flow of the approach [8] is shown in the figure below.
4. Physical Database Design
The random method [8] performs r rounds of random sampling and each round samples a number of values
from the given range that is supposed to be greater than the a in the naïve exhaustive. The following is a pseudo
code for the main routine to implement the proposed random algorithm. Each of the p parameters samples one
from a number of values during each of the r rounds of selection.
5. Implementation methodology
Flow diagram of the proposed heuristically-based algorithm is shown below. During the parameter generation
step (B) p number of parameters are selected. With those parameters, TeraSort benchmark program is executed
and during (C) Get results function, the results are collected and saved into the Omega space (i), During (D), the
best performing parameter setting is selected and fixed out in a greedy manner. This iterates p number of
rounds.
6. Experimental Results
Experimental Results Ω (0)
• i.s.r.p decreased by -20% and fixed out
• CPU time reduced by -17.68%
• Cost : a * p ( a=18, p =7 )
Experimental Results Ω (1)
• m.j.s.m.p decreased by -40% and fixed out
• CPU time reduced by -16.73%
• Cost : a * (p -1) ( a=18, p =7 )
Ω(1) % ParIdx P(0) P(1) P(2) P(3) P(4) P(5) P(6) CPU diff
+ 40 4 100 0.04 0.8 10 0.42 0.66 0.7 40140 1290.71
- 40 5 100 0.04 0.8 10 0.1 0.48 0.7 32350 -6499.3
+ 40 5 100 0.04 0.8 10 0.1 0.64 0.7 41220 2370.71
2
Ω (1) -90% -80% -70% -60% -50% -40% -30% -20% -10%
i.s.m 2.86 6.8 3.4 6.44 6.82 6.31 8.57 -13.64 5.54
i.s.s.p -13.87 6.31 8.68 4.25 2.53 7.7 5.69 6.95 -14.28
i.s.f -12.28 6.93 8.34 0.93 5.15 4.76 1.14 6.62 5.79
m.j.r.i.b.f 3.99 6.28 2.6 2.32 6.31 4.66 2.06 6.62 9.4
m.j.s.m.p 7.21 -14.26 5.92 6.8 7.6 -16.73 4.48 3.04 -9.39
m.j.s.i.b.p 5.77 2.91 8.09 -12.56 6.64 -10.35 6.98 -10.4 6.8
Ω (1) 10% 20% 30% 40% 50% 60% 70% 80% 90%
i.s.m 4.4 9.11 -13.2 5.54 3.94 4.71 -14.67 7.67 8.24
i.s.s.p 6.72 -16.68 2.96 5.25 9.19 -11.58 8.75 -16.47 3.61
i.s.f -12.76 6.08 2.06 6.64 4.07 5.85 -12.82 7.13 -10.42
m.j.r.i.b.f 5.54 6.77 -13.05 3.32 6.54 5.56 8.37 3.5 5.31
m.j.s.m.p 7.34 -10.42 2.78 6.1 7.47 6.8 -11.86 -13.07 6.15
m.j.s.i.b.p 3.4 5.82 8.73 2.04 7.42 5.92 5.38 7.11 -10.71
7. Conclusion
This research has proposed various parametric performance optimization algorithms to achieve the possible best
performance in MapReduce computing on a Hadoop platform. There are three representative algorithms
proposed with their pros and cons concluded as follows:
- Naive Exhaustive: O (a^p) Infeasibly best performance yet very expensive
- Random: O (1*p*r) ≈ O(p^2) Feasibly moderate performance and inexpensive
- Heuristically-based: O (p^3) Feasibly great performance and reasonably inexpensive
Naive exhaustive parametric optimization took O(a^p) amount of time and the resulting performance of the
program is definitely supposed to be the best but benchmarking is infeasible and very expensive. Thus, this is
not considered as a viable solution.
Random parametric optimization took approximately quadratic time and it is considered as the least
inexpensive approach. However, the resulting performance of the benchmark is quite limited to the average.
Lastly, the proposed heuristically-based parametric optimization took slightly over the quadratic time or at
most the cubic time and the resulting performance is also promising with feasible optimization time. The
proposed heuristic employed a greedy approach, and the resulting performance is great instead of optimal since
it doesn’t guarantee the optimal performance
References
[1]."Hadoop Distributed File System Architecture." http://hadoop.apache.org/docs/current/
hadoop-project-dist/hadoop- hdfs/images/hdfsarchitecture.png. Web. 20 Jan. 2015.
[2]. Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS operating systems
review. Vol. 37. No. 5. ACM, 2003.
[3]. "The Fresh Open Source Software Archive." Http://fossies.org/linux/misc/hadoop-1.2.1.tar.gz/hadoop-
1.2.1/docs/mapred-default.html. Web. 20 Jan. 2015.
[4]. Noll, Michael G. "Benchmarking and stress testing a Hadoop cluster with TeraSort, TestDFSIO & co." 2011.4.
http://www. michael-noll. com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-
testdfsio-nnbench-mrbench (2011).
[5]. Hadoop 1.2.1 Documentation, mapred-dedault configuration parameters.
https://hadoop.apache.org/docs/r1.2.1/mapred-default.htm, Web, 20 Jan. 2015.
[6]. T. White, Hadoop: The definitive Guide. O’Reilly Media,Inc.,2012, pp 166-168.
[7]. Kim, Jongyeop, et al. "Performance evaluation and tuning for MapReduce computing in Hadoop distributed file
system." Industrial Informatics (INDIN), 2015 IEEE 13th International Conference on. IEEE, 2015.
[8]. Jongyeop Kim and Nohpill Park. "Identifiation of the Optimal Hadoop Configuration Parameter sets for MapReduce
Computing", 3rd International Conference on Applied Computing & Information Technology, July 2015.