Big Data Framework
Big Data Framework
Abstract— We are constantly being told that we live in the In the Big Data era, computation and storage is cheap per
Information Era – the Age of BIG data. It is clearly apparent that TB. Therefore, with ever-growing computational capabilities,
organizations need to employ data-driven decision making to system utilisation is no longer as critical a factor. It is now
gain competitive advantage. Processing, integrating and feasible to use more computational power to do the same work
interacting with more data should make it better data, providing (hence with lower utilisation). At the same time, the amount of
both more panoramic and more granular views to aid strategic data that needs processing has been increasing exponentially in
decision making. This is made possible via Big Data exploiting the past decade as a result of improvements in data generation
affordable and usable Computational and Storage Resources. and storage capacity [1]. Above all, programming tools and
Many offerings are based on the Map-Reduce and Hadoop
methodologies have matured with globalisation and the
paradigms and most focus solely on the analytical side.
Nonetheless, in many respects it remains unclear what Big Data
Internet. It is increasingly feasible to reuse code (and also share
actually is; current offerings appear as isolated silos that are
it), therefore, the focus has moved to integrating codes created
difficult to integrate and/or make it difficult to better utilize by different communities.
existing data and systems. Paper addresses this lacunae by High performance network capacity, that provides the
characterising the facets of Big Data and proposing a framework backbone for high end computing systems, has not increased at
in which Big Data applications can be developed. The framework the same rate as processing and storage capabilities. Therefore,
consists of three Stages and seven Layers to divide Big Data the constraint in computation has simply shifted from moving
application into modular blocks. The aim is to enable
data to a big supercomputer, to moving the application to many
organizations to better manage and architect a very large Big
Data application to gain competitive advantage by allowing
smaller computers where the data resides (function shipping
management to have a better handle on data processing. rather than data shipping). Programming such an approach is
not new, the application is executed where the data is kept in a
Keywords: Big Data, data scientist, analytics, business loosely coupled and highly distributed architecture [8].
intelligence, information management, strategy, hadoop In contrast, Relational Database Management Systems
I. INTRODUCTION (RDBMS) tend to provide access to data as one Big Data silo
based on efficient closely coupled systems. Structured Query
We live in the Information Era – the Age of BIG data Language (SQL) is the de-facto method to access databases as
[1][2]2. it provides relatively easy access to data at different levels
As an example, Big Data’s significance and power became within organisations. It is common to see low-level
apparent when the results of the 2012 US Presidential Elections programmers and high level business analysts sharing the same
were announced. Complex analytics processing large data not piece of SQL and understanding, or trying to understand it.
only predicted the exact election results but may also have This sharing model has its limitations and cannot exploit
influenced it [3][4]. Further, leading business magazines and and handle the massive increase in static non-changing data.
economical newspapers run frequent articles about Big Data’s Recently, there has been an increase in NoSQL approaches to
success [5][6]. overcome these weaknesses [9]. Despite their relatively recent
However, it should be recognised that Big Data is not emergence, there are now more than one hundred NoSQL
something new, it has long been the playground of the elite. approaches that specialize in management of different multi-
The aim was to maximise expensive CPU utilisation. As a modal data types (from structured to non-structured) and with
result, it had a limited audience as computation and storage was the aim to solve very specific challenges. Most are powered by
expensive and difficult to utilise requiring detailed systems the Map-Reduce paradigm that came from Google, which is
knowledge where capacity doubles every 18 months [7]. based on a massively distributed architecture that exploits
cheap commodity hardware. As a result, the need for efficient
1
mechanism for storing and processing data is eliminated. It is
Firat Tekiner has an honorary research fellowship at the School of Computer in fact cheaper to duplicate (for reliability) and to over-
Science, University of Manchester.
2
We bemusedly note that “big” has to an extent replaced “very large” from a
compute (process duplicate data) as communication is
previous generation – both remain undefined in any quantitative senses and relatively more expensive than storage and computational
seem to mean “whatever data amount challenges the state-of-the-art”. resources (and this gap is increasing).
1495
challenges in HPC to achieve fault tolerance and availability. market place. Therefore, organizations that can respond and
Therefore, it paves the way for development of highly parallel, employ talents to understand, analyze, process and manage this
highly reliable and distributed applications on large datasets. information life cycle will lead the way. This is what has
generated the interest, and hype, associated with big data. As a
III. BIG DATA FRAMEWORK caveat, while integration of data provides many advantages, a
A. Big Data Characteristics significant associated risk is data privacy and ownership of the
data. This is usually omitted or not understood at this stage and
Big Data is not only driven by the exponential growth of sometimes later) as the rush is to gain competitive advantage.
data but also by changing user behaviour and globalization.
Much more time is being spent online and using mobile The paper argues that a Big Data application is the
devices. Furthermore, globalization of the marketplace orchestration of all the software and hardware systems within
increases competition. As a result, organizations constantly the enterprise that generates and processes data. It means
look for opportunities to increase their competitive advantage something different for each person, application or
in an increasingly competitive market place by using better organization. For example, even a click whilst browsing a web
analytical models. Hence, it is necessary to present findings in page can be an input or a heartbeat packet that is sent over the
a more clear and concise form. In turn, there has been a network to inform that a system is still up and running. Up to
commensurate increase in business intelligence applications now it was infeasible to store or process at this level of
that allow better reporting and visualization of the data. information. However, the information era is changing this.
To derive the framework, we firstly define the B. Framework
characteristics of “Big Data”:
There are formal approaches to project management that
1. Data/Processing Volume and Scale provide a methodology to manage Information Systems and
2. Variety and Heterogeneity of Data/Sources drive strategy in Organizations; to name a few, Open Group’s
TOGAF, IBM’s Zachman and Gartner’s methodology [23].
3. Speed and Timeliness of Information Requirement However, these are not designed to provide a framework that
has a data focus, rather their aim is to provide methodologies to
4. Targeted Services, Products, Solutions and
manage large information systems. In addition, [24] proposes a
Applications
framework that looks into Big Data governance with an aim of
5. Data Presentation, Usability and Interpretation managing people and policies. In contrast, having identified the
characteristics of Big Data, this paper aims to define a
6. Data Privacy, Error Handling and Security framework that captures all the stages of a Big Data application
Data volume has been increasing exponentially: up to 2.5 with a strategic point of view of focusing on data. Although,
Exabytes of data is already generated and stored every day. [25] provides a Big Data methodology with a data focus it does
This is expected to double within 40 months by 2015 [20]. As not take into consideration the systems aspect of a Big Data
always there remains the challenge to process such volume. environment. Furthermore, there is a need to bridge strategic
decision making and real life scenarios. This paper aims to fill
The variety and heterogeneity of data sources and storage this space.
has increased, fuelled by the use of cloud, web and online
computing. The challenge then becomes identification of data Without a coordination and structuring framework there is
that will add value, and hence increase information content and likely to be much overlap amongst applications, duplication in
competitive advantage [21]. Clearly, currency of information is stored information and confusion around the responsibilities of
crucial as analytics derived from new data is usually more each business unit and application. The framework here seeks
valuable than old. to document the borders of each modular block to allow gaps to
be spotted in a Big Data application and provide solutions by
For example, consider an online business which analyses closer integration. Further, it aims to highlight how Map-
every click on its website. An advert or offering is made based Reduce can be included into the different stages and layers of
on user’s movement and activities whilst browsing and the Big Data application life cycle. Whilst doing this, all
shopping online. However, the adverts can be better targeted surrounding issues and approaches are considered.
and the customers better segmented if customer profiles can be
updated and integrated in sub-second intervals [22]. There are The framework should ultimately provide a basis to
different patterns to the data and it is presented in different develop and manage Big Data applications whilst identifying
shapes. For example, management requires reporting and strategies based on core competencies and weaknesses. In
statistical analytics to be made available based on new data in addition to the 7 layers identified in Figure 2, the process as a
order to be able to respond rapidly to changing requirements whole can be summarized in 3 main stages as below:
[21]. As these analytics provide predictive insight, the resulting STAGE 1 Multiple Data Sources - Choose the Right
decisions are both more robust and timely. Data [26]
However, a shortage of skills and immature tools makes it a STAGE 2 Data Analysis and Modelling
daunting task for organizations to present and interpret this
newly discovered information and capability. Current STAGE 3 Data Organization and Interpretation
hierarchical management models introduce difficulties in
dynamic development and adaptation in an ever changing
1496
Stage 1 is concerned with acquisition and filtering of data available in Big Data applications [28]. Therefore, there may
by applying correct metadata and processes. Multiple data not be an upfront model whilst organizing the source data to
sources are integrated and transformed to add meaning to the the target. As a result of this, there have been a large number of
data. This process is the major source of added-value (to data) applications that focus on providing access to these data
and allows organizations to gain competitive advantage. sources via NoSQL without using SQL. They attempt to create
indexing schemes similar to RDMBS and provide quick access
Stage 2 then uses the information prepared in Stage 1 to to data residing in the Hadoop file system [29][30].
apply analytics and predictive models to find relationships and
patterns that were not initially known. The level of intelligence Presentation and visualisation of data is an important task.
applied depends on the computational capabilities and skill-set The NoSQL option changes the dynamics in terms of accessing
available together with the business requirements. Big Data and presenting the data. With increasing data to be analysed
uses internal and external datasets from a variety of sources to and processed, therefore, output needs to address both clarity
provide information to aid strategic decision making to gain and precision of presentation. In addition, interpretation of
competitive advantage. It allows focus on the current and the results is a major challenge that requires highly skilled staff.
future rather than traditional historical reality. Whilst doing
The Processing stages described map onto the 7 layers of
this, it further requires cross-functional collaboration at both
the framework. Each application may focus on different layers
business and technical level (data sources and systems) [27].
and may not employ all parts of it. A Big Data application then
Stage 3 then deals with modelling the source information becomes a major orchestrating effort whereby a large number
and mapping the data to the target model whilst interpreting the of moving parts needs to be composed to work seamlessly to
meaning of the newly discovered information. The relational achieve results that enable competitive advantage.
data model does not naturally accommodate the unstructured
and heterogeneous data sources that are expected to be
1497
From a detailed perspective, how the Map-Reduce tasks in order to manage and integrate this spread within an
will be applied depends on the application. As each map and organisation. To clarify what Big Data really is, it is the
reduce process can run in parallel, both can be used to speed enterprise data processing environment for heterogeneous
up processing. Furthermore, at any given time, a number of data and computational sources in a timely manner to gain
Big Data applications can run at different layers or at competitive advantage. This results in the processing of high
different stages. volume of data and presenting this in a concise and clear
manner to aid operational and strategic decision making.
An important challenge is to bring together and map the
relational database model with columnar, key-value stores Due to the immaturity of the field, there is little or no
and unstructured data. For example, Banks are experts about coordination across Big Data silos (applications). Big Data
their customers; such information may be multiplied in value does not only require in-depth systems and data expertise but
if joined together with unstructured sources [5]. In addition, also requires strategic insight due to the nature of the
Business Intelligence and Reporting applications requiring applications [11]. Such applications are evolving very
aggregations on a certain field are best served by DBMS that quickly and designed to aid strategic decision making by
employes a columnar storage. Given that Business responding rapidly to changes in market place. Organisations
Intelligence traditionally uses RDBMS accessed via tools that are very hierarchical and bureaucratic may initially
based on SQL, a change is needed. struggle to compete with the new economy companies [33].
This is evident from the fact that the companies that
Whilst the framework looks at and across all dimensions successfully apply Big Data applications are the likes of
of the problem, almost all current Big Data approaches are Google, Amazon, Yahoo and Facebook who are the leading
silo-based without coherent linkage or integration. The new economy companies.
modelling and mapping layer aims to do this with respect to
data. For example, how the system and storage resources There is a lack of the multi-faceted role skills (analyst,
could be shared amongst different applications that would developer, architect and management) required to orchestrate
come under the Big Data framework is not a primary apsect such applications [16]. The framework proposed here aims to
of the design. This resource scheduling and maintenance document and structure this gap and provide a starting point
needs to be managed at the system layer. In terms of storage, for practitioners, analysts and management to develop and
the challenge is bigger due to recent improvements in the exploit their Big Data applications.
medium; hybrid solid state, optical and hybrid disks operate
at various speeds in addition to slower archiving systems. The framework can be seen as a cube corresponding to
Separately, the data layer is expected to manage different the levels. Each face represents an important level within the
data sources, handle data lineage and eliminate duplication Big Data space, while the cube as a whole represents the
that would otherwise be inevitable in an island of entire Big Data space and the integrated whole we believe to
applications. This gap will grow further as storage struggles be essential for effective deployment and evolution of the
to cope with data growth and the Hadoop File System associated applications. To achieve this requires efficient and
(HDFS) provides a cheaper yet reliable and performant effective orchestration, integration and coordination of skills
alternative to current storage systems [31]. In addition, data that address the challenges both within and across all seven
privacy and security [32] aspects could be more easily levels defined in the previous section. This then needs to be
managed within this layer under one common enterprise further complemented by novel management and decision
wide policy. making strategies [34][35]. Thus there is a need for more
technical managers and decision makers, and the lack of
It has to be noted that, there are components which people with analytic skills is likely to be a challenge.
cannot be divided, such as SQL and DBMS or DBMS and
hardware. For instance, modern DBMS make explicit use of Big Data intersects with numerous domains including
and manage the underlying hardware. As a result of this there data integration, hardware and software, databases, Business
will also be overlaps and islands in an organisation. Hence, Intelligence, system integrators and consulting firms [36].
the application of the framework and abstraction based on The associated skill set is vast, and this is one of the
the different layers. challenges and confusions surrounding Big Data applications
[15]. The associated scale is daunting and well indicates the
IV. CONCLUDING REMARKS need for integration to achieve effective and efficient use of
Big Data [37]. Organizations need to be singularly focused
Processing larger datasets has become increasingly whilst providing and employing Big Data solutions as
possible over the past few years for a much larger management of all elements of the framework is challenging.
community, not least via the development of the Map- Furthermore, there are many, and growing, numbers of
Reduce paradigm. Map-Reduce enable the power of parallel applications that aim to use this improved “knowledge”.
computing to be available to standard data analysis tasks3. When all these aspects are considered, the combined issue is
As mentioned before, the main challenge in applying technical management and people management. Traditional
many islands of Big Data applications is to identify the management does not understand and cannot be expected to
defining lines of each application and their inter-relationship, understand what can be achieved and what the related
challenges are. Hence, the framework is proposed to aid
3
It should be noted that Map-Reduce has weaknesses in that it is not, by decision making process and bridge the gap with business
design, general-purpose, but rather was designed for something very needs and technical realities.
specific: keyword processing and access.
1498
REFERENCES [23] Roger Sessions, Microsoft Developer Network Architecture Center,
"A Comparison of the Top Four Enterprise Architecture
[1] The Economist, Nov 2011, "Drowning in numbers – Digital data will Methodologies", May 2007, [Link]
flood the planet and help us understand it better", us/library/[Link]
[Link]
[24] Malik P., "Governing Big Data: Principles and practices", IBM
[2] Lohr S., Feb 11, 2012, "The Age of Big Data", New York Times, Journal of Research and Development, Volume:57 , Issue: 3/4, May-
[Link] July 2013, pp 1-13.
[Link]
[25] Miller G. H. and Mork P., IT Pro, Jan-Feb 2013, “From Data to
[3] Lynch M., Nov 13, 2012, "Barack Obama's Big Data won the US Decisions: A Value Chain for Big Data”, IT Profesional, 15(1), 57-59
election", Computerworld,
[Link] [26] Barton D. and Court D., October 2012, "Making Advanced Analytics
_s_Big_Data_won_the_US_election Work for You", Harvard Business Review 89(10):78-83.
[4] Fanning K. and Grant R.,July/August 2013, "Big Data: Implications [27] Goyal M., Hancock M. Q., and Hatami H., July-August 2012,
for Financial Managers", Wiley Journal of Corporate Accounting & "Selling into Micromarkets", Harvard Business Review 89(7-8):78-
Finance, (24):5:23–30. 86.
[5] The Economist, 19 May 2012, "Big data - Crunching the numbers", [28] Rong C., Lu W., Wang X., Du X., Chen Y., Tung A. K. H., 02 Oct.
[Link] 2012. "Efficient and Scalable Processing of String Similarity Join"
pre-print IEEE Transactions on Knowledge and Data Engineering,
[6] Taylor, P, 26 June 2013, "Big data in the spotlight as never before", <[Link]
Financial Times
[29] Chandrasekar S., Dakshinamurthy R., Seshakumar P.G., Prabavathy
[7] French R. M., December 2012, "Moving beyond the Turing test", B., Babu C., 4-6 Jan. 2013, "A novel indexing scheme for efficient
Communications of the ACM, 55(12):74-77 handling of small files in Hadoop Distributed File System", In
[8] Tekiner F., Tsuruoka Y., Tsujii J., Ananiadou S., Keane J., "Parallel Computer Communication and Informatics (ICCCI), 2013, ISBN:
Text Mining for Large Text Processing", pages 348-353, in 978-1-4673-2906-4, 1-8.
Proceedings of IEEE CSNDSP2010, 21-23 July, Newcastle,UK [30] Gudmundsson G.P., Amsaleg L., Jonsson B.P., 27-29 June 2012,
[9] Bonnet L., Laurent A., Sala M., Laurent B., Sicard N., September "Distributed High-Dimensional Index Creation using Hadoop,HDFS
2011, "Reduce, You Say: What NoSQL Can Do for Data Aggregation and C++”, Content-Based Multimedia Indexing (CBMI), 83-88, E-
and BI in Large Repositories", dexa, pp.483-488, 22nd International ISBN:978-1-4673-2369-7
Workshop on Database and Expert Systems Applications, 2011 [31] Saran C., "Storage struggles to keep up with data growth explosion",
[10] Merrill R. D., June 2011 , "Storage Economics - Four Principles for Computer Weekly, 12-18 February 2013, pp.17-19
Reducing Total Cost of Ownership", Hitachi Data Systems [32] Terence C. and Ludloff M. E., "Privacy and Big Data", Orielly, 2011,
Coroporation White Paper. ISBN: 978-1-449-30500-0
[11] McGuire T., Manyika J., Chui M., July / August 2012, "Why Big [33] McCallum J. S., March/April 2001, "Managing in the new economy:
Data is the New Competitive Advantage”, Ivey Business Journal, evolution or revolution?", Ivey Business Journal,
[Link]/topics/strategy/why-big-data-is-the- [Link]
new-competitive-advantage organization/managing-in-the-new-economy-evolution-or-revolution
[12] Mark B., "Gartner Says Solving 'Big Data' Challenge Involves More [34] McAfee A. and Brynjolfsson E., October 2012, "Big Data: The
Than Just Managing Volumes of Data". Gartner, June 27, 2011, Management Revolution", Harvard Business Review 89(10):60-69.
[Link]
[35] Rosenbush S. and Totty M., 8 March 2013, "How Big Data Is
[13] Johnson E. J., July/August 2012, "Big Data + Big Analytics = Big Changing the Whole Equation for Business", The Wall Street Journal,
Opportunity", Journal of Financial Executive, pp. 1-4. [Link]
[14] Nichols, W., March 2013, "Advertising Analytics 2.0", Harvard [Link]
Business Review, 91(3): 60-68. [36] Chen H., Chiang R. H. L., Storey V. C., December 2012, "Business
[15] Stonebraker M., Hong J., February 2012, "Researchers' Big Data Intelligence and Analytics: From Big Data to Big Impact," MIS
Crisis; Understanding Design and Functionality", Communications of Quarterly, 36(4):1165-1188.
the ACM, 55(2):10-11 [37] Courtney M., December 2012, "Big Data analytics: putting the puzzle
[16] Davenport, Thomas H., and D. J. Patil., October 2012, "Data together", Engineering and Technology Magazine, 7(12):pp 56-60.
Scientist: The Sexiest Job of the 21st Century." Harvard Business
Review 90(10):70-76.
[17] White T., May 2012, "Hadoop: The Definitive Guide", Third Edition,
O'Reilly, 978-1-449-31152-0
[18] Saecker M. and Markl V., "Big Data Analytics on Modern Hardware
Architectures: A Technology Survey", 2013, Springer Lecture Notes
in Business Information Processing, Volume 138, pp 125-149
[19] McCreadie R., Macdonald C., Ounis I., 2012 "MapReduce indexing
strategies: Studying scalability and efficiency", Journal of
Information Processing and Management: an International Journal
archive, 48 (5). pp. 873-888. ISSN 0306-4573
[20] Manyika J., Chui M., Brown B., Bughin J., Dobbs R., Roxburgh C.,
Byers A. H., "Big data: The next frontier for innovation, competition,
and productivity", McKinsey Global Institute, May 2011,
[Link]
_next_frontier_for_innovation
[21] Allen B., Bresnahan J., Childers L., Foster I., Kandaswamy G.,
Kettimuthu R., Kordas J., Link M., Martin S., Pickett K., Tuecke s.,
February 2012, "Software as a Service for Data Scientists",
Communications of the ACM, 55(2):81-88
[22] Smith S., Mar 04, 2013, "Is Data the New Media?" EContent
Magazine, March 2013 Issue:14-19.
1499