BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
Coursework Assignment Brief
Postgraduate
Academic Year 2022
Module Title: Big Data Management
Module Code: CMP7203
Assessment Title: Design and implementation of a big data solution
Assessment Identifier: Coursework CWRK001 Weighting: 100
School: School of Computing and Digital Technology
Module Co-ordinator: Besher Alhalabi
20th May 2022 - 12 noon
Hand in deadline date:
Return of Feedback date and 17th June 2022 - 12 noon
format
Re-assessment hand in deadline 29th July 2022 - 12 noon
date:
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
Support available for students
Timetabled revisions sessions will be arranged for the period immediately
required to submit a re-
preceding the hand in date
assessment:
At the first assessment attempt, the full range of marks is available. At the re-
NOTE: assessment attempt the mark is capped and the maximum mark that can be
achieved is 50%.
Assessment Summary The student will build a big data ecosystem using the tools
and methods that discussed during the module; to do this
the student will analyse a data set simulating big data
generated from a large number of users who are playing an
imaginary game called "Catch the Pink Flamingo". The
student needs to achieve the following:
1- Critically evaluate the three big data processing
paradigms as discussed during the course.
2- Develop and implement a big data solution that
covers the following:
Data Exploration: including acquiring, cleaning,
exploring and preparing the data for analysis; any
relevant data exploration tool/visualisation that
could fit the purpose is allowed to be used.
Machine learning with big data: applying
classification & clustering techniques on the
proposed dataset.
Graph Analysis: Using Neo4j/ Gephi to perform
graph analytics of the simulated chat data to find
ways of improving the game.
3- Discuss and visualise the resulting data insights.
4- Evaluate the role of ethics on data storage and processing.
The report will be submitted as one deliverable in the form
of a written report. The standard of academic writing should
be excellent. (Maximum words: 4000 words, excluding
tables, figures and references).
Report Writing
Big Data:
Big Data is a combination of structured, semi-structured, and unstructured data collected by
organizations that can be mined for insights and used in machine learning projects, predictive modeling,
and other advanced analytics applications.
Systems that process and store big data have become a common component of data management
architectures in organizations, combined with tools that support the uses of big data analytics. Big Data is
often characterized by the three Vs the large volume of data in many environments the wide variety of
data types frequently stored in big data systems and the rate at which much data is generated, collected
and processed.
These characteristics were first identified in 2001 by Doug Laney, then an analyst at the consulting firm
Meta Group Inc. Garter popularized them further after acquiring Meta Group in 2005. More recently,
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
several other Vs have been added to different descriptions of big data, including veracity, value, and
variability.
Although big data is not a specific volume of data, big data deployments often involve terabytes,
petabytes, and even exabytes of data created and collected over time.
Businesses use big data in their systems to improve operations, provide better customer service, create
personalized marketing campaigns, and take other actions that ultimately can increase revenue and profit.
Companies that use it effectively hold a potential competitive advantage over those that don't, as they are
able to make faster and more informed business decisions.
Businesses use big data in their systems to improve operations, provide better customer service, create
personalized marketing campaigns, and take other actions that ultimately can increase revenue and profit.
Companies that use it effectively hold a potential competitive advantage over those that don't, as they are
able to make faster and more informed business decisions.
Essential Types of Data Analysis Methods
Before diving into the seven essential types of methods, it is important that we run through the main
categories of analysis very quickly. Starting with the category of descriptive analytics through to
prescriptive analytic, the complexity and effort of evaluating data increases, but so does the added value
for the business.
a) Descriptive analysis - What happened.
The descriptive analysis method is the starting point of all analytical reflection, and it aims to answer the
question of what happened? It does this by ordering, manipulating and interpreting raw data from various
sources in order to transform it into valuable information for your organization.
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
Descriptive analysis is essential because it allows us to present our ideas in a meaningful way. While it's
worth mentioning that this analysis alone won't predict future outcomes or give you the answer to
questions such as why something happened, it will leave your data organized and ready to go. conduct
further investigations.
b) Exploratory Analysis - How to explore relationships between data.
As the name suggests, the main purpose of exploratory analysis is to explore. Before that, there was still no
notion of the relationship between data and variables. Once the data is studied, exploratory analysis allows
you to find connections and generate hypotheses and solutions to specific problems. A typical application
area is data mining.
c) Diagnostic analysis - Why it happened.
Analyzing diagnostic data empowers analysts and executives by helping them gain a solid contextual
understanding of why something happened. If you know why something happened as well as how it
happened, you will be able to identify the exact ways to solve the problem or challenge.
Designed to provide direct and actionable answers to specific questions, it is one of the most important
research methods in the world, among its other key organizational functions such as retail analytics.
How to collect data for machine learning if you don't have it
The line between those who can play ML and those who can't is drawn by years of information gathering.
Some organizations have been accumulating records for decades with such success that they now need
trucks to move them to the cloud because conventional broadband just isn't wide enough.
For those new to the scene, a lack of data is expected, but luckily there are ways to turn that minus into a
plus.
First, rely on open source datasets to kick-start ML execution. There are mountains of data for machine
learning and some companies (like Google) are willing to give it away. We'll talk about public dataset
opportunities a bit later. While these opportunities exist, the real value usually comes from the golden
nuggets of data collected internally and extracted from your own company's business decisions and
activities.
Second - and unsurprisingly - you now have the ability to collect data the right way. Companies that
started data collection with paper records and ended with .xlsx and .csv files are likely to have a harder
time preparing data than those with a small but proud data set compatible with the ML. If you know the
tasks that machine learning needs to solve, you can customize a data collection mechanism in advance.
What about big data?
It's so buzzy, it seems like the thing everyone should be doing. Aiming for big data from the start is
a good mindset, but big data isn't about petabytes. It all depends on the ability to deal with them correctly.
The larger your data set, the more difficult it becomes to make good use of it and gain insights. Having tons
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
of wood doesn't necessarily mean you can convert it into a warehouse full of chairs and tables. So, the
general recommendation for beginners is to start small and reduce the complexity of their data.
Articulate the problem early Knowing what you want to predict will help you decide which data may be
more useful to collect. When formulating the problem, do some data exploration and try to think in the
categories of classification, clustering, regression, and ranking that we discussed in our white paper on the
business application of machine learning. Simply put, these tasks are differentiated as follows:
Classification: You want an algorithm to answer binary yes or no questions (cats or dogs, good or bad,
sheep or goats, you get the idea) or you want to do multi-class classification (grass, trees or bushes; cats,
dogs, or birds, etc.) You also need to label the correct answers, so that an algorithm can learn from them.
Check out our guide on how to approach data labeling in an organization.
Grouping: You want an algorithm to find the classification rules and the number of classes. The main
difference with classification tasks is that you don't actually know what the groups are and the principles of
their division. For example, this usually happens when you need to segment your customers and tailor a
specific approach to each segment based on its qualities.
Regression: You want an algorithm to produce a numeric value. For example, if you spend too much time
finding the right price for your product, as it depends on many factors, regression algorithms can help you
estimate that value.
Ranking: Some machine learning algorithms simply classify objects according to a number of
characteristics. The ranking is actively used to recommend movies in video streaming services or to show
products that a customer might purchase with high probability based on their previous search and
purchase activity.
Chances are your business problem can be solved in this simple segmentation and you can start tailoring a
dataset accordingly. The rule of thumb at this stage is to avoid overly complicated problems.
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
Establish data collection mechanisms:
Creating a data-driven culture in an organization is perhaps the hardest part of the whole initiative. We
touched on this briefly in our Machine Learning Strategy article. If you plan to use ML for predictive
analytics, the first thing to do is to combat data fragmentation.
For example, if you look at travel technology - one of Alternate's core areas of expertise - data
fragmentation is one of the key analytics issues here. In hotel businesses, departments in charge of
physical property go into quite intimate detail about their customers. Hotels know guests' credit card
numbers, the types of amenities they choose, sometimes home addresses, room service usage, and even
drinks and meals ordered during a stay. However, the website people book these rooms on may treat them
like complete strangers.
This data is sailed into different departments and even different tracking points within a department.
Marketers may have access to a CRM, but customers are not associated with web analytics. It's not always
possible to converge all data streams to a centralized storage if you have many engagement, acquisition,
and retention channels, but in most cases it's manageable.
Usually, data collection is the job of a data engineer, a specialist responsible for creating data
infrastructures. But in the early stages, you can hire a software engineer who has some database
experience.
Human factor management:
Another point here is the human factor. Collecting data can be a tedious task that overwhelms your
employees and overwhelms them with instructions. If people have to constantly and manually make
records, they're likely to view these tasks as just another bureaucratic quirk and let the work slip away. For
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
example, Salesforce provides a decent set of tools to track and analyze seller activity, but manual data
entry and activity logging alienate sellers.
This problem can be solved with the help of robotic process automation systems. RPA algorithms are
simple rule-based robots that can perform tedious and repetitive tasks.
Preparing your dataset for machine learning: core techniques that improve your data
Articulate the problem early.
Establish data collection mechanisms. ...
Check the quality of your data.
Format the data to make it consistent.
Reduce data.
Full data cleaning.
Create new features from existing ones.
Types of Machine Learning Algorithms:
There are some variations in defining the types of machine learning algorithms, but they can generally be
divided into categories based on their purpose and the main categories are:
Supervised teaching
Unsupervised learning
Semi-supervised learning
Reinforcement learning
Supervised teaching
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
I like to think of supervised learning with the concept of function approximation, where basically we train
an algorithm and at the end of the process we choose the function that best describes the input data, the
one that, for an X given, makes the best estimate of y (X -> y). Most of the time we are not able to
determine the real function which always makes the correct predictions and another reason is that the
algorithm is based on an assumption made by humans on how the computer should learn and these
assumptions introduce bias, bias is the subject I will explain in another post.
Here the human experts act as a teacher where we feed the computer with training data containing the
inputs/predictors and we show it the correct answers (output) and from the data the computer should be
able to learn patterns.
Supervised learning algorithms attempt to model the relationships and dependencies between the target
prediction output and the input features so that we can predict output values for new data based on the
relationships it has learned to from previous datasets.
List of common algorithms
nearest neighbor
naive bayes
Decision trees
Linear regression
Support Vector Machines (SVM)
Neural networks
Unsupervised learning
The computer is trained with unlabeled data.
Here there is no teacher at all, in fact the computer might be able to teach you new things after learning
patterns in the data, these algorithms are especially useful in cases where the expert human does not
know what to look for in the data.
are the family of machine learning algorithms that are mainly used in pattern detection and descriptive
modeling. However, there are no categories or output labels here based on which the algorithm can try to
model the relationships. These algorithms attempt to use techniques on the input data to search for rules,
detect patterns, summarize and cluster data points that help in gaining meaningful insights and better
describing the data to users.
Semi-supervised learning
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
In the two previous types, either there are no labels for all observations in the dataset, or labels are
present for all observations. Semi-supervised learning falls between the two. In many practical situations,
the cost of labeling is quite high, as it requires trained human experts to do it. Thus, in the absence of
labels in the majority of observations but present in few observations, semi-supervised algorithms are the
best candidates for building the model. These methods exploit the idea that even if the group
memberships of unlabeled data are unknown, these data contain important information about group
parameters.
Reinforcement learning
The method aims to use observations gathered from interaction with the environment to take actions that
would maximize reward or minimize risk. The reinforcement learning algorithm (called the agent)
continuously learns from the environment in an iterative way. In the process, the agent learns from its
experiences of the environment until it explores the full range of possible states.
Analyse big data problems using scalable machine learning algorithms on Spark:
Hadoop implemented MapReduce and designed a distributed file system called HDFS, where a single large
file is split and stored on the disks of multiple computers. The idea was to split huge databases on hard
drives into multiple motherboards, each with a CPU, RAM, hard drive, etc. individual, interconnected by a
fast LAN network.
Spark was good for both,
i) Data intensive tasks: as it was using HDFS and
ii) Computationally heavy duty: as it uses RAM instead of disk, to store intermediate outputs. Ex: Iterative
solutions
Since Spark could use RAM, it became an efficient solution for iterative machine learning tasks like
Stochastic Gradient Descent (SGD). This is why Spark MLlib has become so popular for machine learning,
unlike Hadoop's Mahout. Also, to do Distributed Deep-Learning with TF, you can use, Multiple GPUs on the
same enclosure (or) Multiple GPUs on different enclosures (GPU Cluster) While today's supercomputers
use the GPU cluster for compute-intensive tasks, you can install Spark in such a cluster to make it suitable
for tasks like distributed deep learning, which are both CPU-intensive in math and data.
Introduction to Hadoop and Spark.
Mainly, there are 2 components in Hadoop,
Hadoop Distributed File System (HDFS): A fault-tolerant distributed file system used by Hadoop and Spark.
HDFS allows a large file to be split into 'n' pieces and kept in 'n' nodes. When accessing the file, different
blocks of data must be accessible, across the nodes via the local network.
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
Map-Reduce: Given a task on a huge amount of data, spread over many nodes, many data transfers must
take place and processing must be distributed. Let's examine this in detail.
To implement a function in Hadoop you just need to write the Map & Reduce function. Please note that
there are disk I/O between each Map-Reduce operation in Hadoop. However, almost all ML algorithms
work iteratively. Each iteration step in SGD [Equation below] corresponds to a Map-Reduce operation.
After each iteration step, intermediate weights will be written to disk, occupying 90% of the total
convergence time.
Spark was born in 2013 and replaced disk I/O operations with in-memory operations. With the help of
Mesos – a distributed system core – Spark caches the intermediate dataset after each iteration. Because
the output of each iteration is stored in RDD, only one disk read and write operation is required to
complete all iterations of SGD. Spark is built on Resilient Distributed Dataset (RDD), an immutable, fault-
tolerant collection of distributed datasets stored in main memory. In addition to RDD, the DataFrame API is
designed to abstract away its complexity and facilitate machine learning on Spark.
how you distribute any optimization-based ML algorithm. However, see Hadoop vs Spark performance, for
a distributed LR implementation..
Spark RDDs allow multiple mapping operations to be performed in memory, there is no need to write
intermediate datasets to disk, which is 100 times faster. Note that the time taken for the first iteration is
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
almost the same, since both Hadoop and Spark need to read from disk. But in subsequent iterations,
Spark's memory read only takes 6 seconds compared to Hadoop's disk read 127 seconds.
Also, an ML scientist does not need to code the Map and Reduce functions. Most ML algorithms are
contained in Spark MLlib and all data preprocessing is done using Spark SQL.
Task 1: Estimate the value of Pi (π)
Take a unit circle and consider a square circumscribed to the circle.
Area of unit square = 1
Since it is a unit circle, the area of the circle = π
Quarter arc area = π/4
Thus, π = 4 * area of the quarter arc
The area of quarter arc can be computed using,
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
1. Numerical Methods: using integration
2. Monte Carlo Approach: to find answers using random sampling
In Monte Carlo Approach,
Take uniform distribution of (x, y) points from 0 to 1 (i.e. inside square)
Area of quarter region = % of points within the circle, i.e.
Eg: out of 1000 random points, if ‘k’ points are within the circle, then area of shaded region = k/1000
These operations are trivially parallelizable as there is no dependency across nodes in order to check
whether a point falls within the circle. Below pyspark code, once run on Spark local setup, will output value
nearer to π=3.14 as we increase number of random points (NUM_SAMPLES)
The random function will generate a number between 0 and 1.
The "inside" function runs a million times and returns "True" only when the random point is inside the
circle.
sc.parallelize() will create an RDD split into k=10 instances.
The filter will apply the passed function.
Time Series Prediction Using Random Forest
Let's solve an ML problem in Standalone, Spark Local & Cluster mode.
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
Problem statement: Daily temperature, wind, precipitation, and humidity of a location are noted from
1990 to 2020. Given these characteristics, create a time series model to predict humidity in Y2021. To
verify the model, use the 2020Q4 humidity values to compare, using a metric.
The full source code for the experiments below can be found here.
Data:
Output:
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
Spark is a fast, general-purpose engine for large-scale data processing. The fast part means that it's faster
than previous approaches to work with Big Data like classic MapReduce. The secret to being faster is that
Spark runs on memory (RAM), which makes processing much faster than on hard drives.
Structured data typically resides in relational databases (RDBMS). Fields store length-delimited data like
phone numbers, social security numbers, or zip codes. Records even contain variable-length text strings
like names, making searching easier. Data can be human or machine generated, as long as it is created in
an RDBMS structure. This format is eminently searchable, both with human-generated queries and through
algorithms using data types and field names, such as alphabetical or numeric, currency or date.
Common relational database applications with structured data include airline reservation systems,
inventory control, sales transactions, and ATM activity. Structured Query Language (SQL) is used to
perform queries on this type of structured data in relational databases.
Some relational databases store or point to unstructured data, such as customer relationship management
(CRM) applications. Integration can be tricky at best because memo fields don't lend themselves to
traditional database queries. Yet most CRM data is structured.
Graph Database (GDB) is a database that uses graph structures for semantic queries with nodes, edges,
and properties to represent and store data.[1] A key concept of the system is the graph (or edge or
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
relation). The graph connects the data items in the store to a collection of nodes and edges, the edges
representing the relationships between the nodes. Relationships allow store data to be directly linked and,
in many cases, retrieved in a single operation. Graph databases maintain relationships between data in
priority. Querying relationships is fast because they are perpetually stored in the database. Relationships
can be visualized intuitively using graphical databases, making them useful for highly interconnected data.
Doing Ethics in the Age of Big Data Research. Drawing on a variety of theoretical paradigms (i.e., from gritty
conceptual distinctions to ongoing discussions of ethics in data-driven research. In the second decade of
the 21st century, a grand narrative is emerging that posits that knowledge derived from data analysis is
true, because of the objective qualities of the data, the means of its collection and analysis, and the sheer
size of the data set.The byproduct of this grand narrative is that the qualitative aspects of behavior and
experience that form the data are diminished and the human is removed from the analysis process.This
situates data science as an analysis process performed by the tool, which obfuscates human decisions in
the process.The researchers involved in this special issue problematic assumptions and trends in big data
research and highlight the crisis of accountability that emerges from the use of c data to carry out societal
interventions. Our contributors offer a range of answers to the question of how to configure ethics through
a methodological framework in the context of the prevalence of big data, neural networks, and the
automated and algorithmic governance of much of society. human ability.
Its refers to extremely large datasets intended for computational analysis that can be used to advance
research by revealing patterns and associations. Innovative research that takes advantage of Big Data can
significantly advance the fields of medicine and public health, but can also raise new ethical challenges.
This article explores these challenges and how they might be addressed in such a way that individuals are
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
optimally protected. Key ethical concerns raised by big data research include respecting patient autonomy
through the provision of adequate consent, ensuring fairness, and respecting participant privacy. Examples
of actions that could be taken to address these key concerns at a broader regulatory level, as well as at a
specific case level, are presented. Big data research offers enormous potential, but due to its widespread
influence, it also introduces the potential for massive harm. It is imperative to review and consider the risks
associated with this research.
Big Data solutions help detect customer sentiment about products or services of an organization and gain a
deeper, visual understanding of the multichannel customer journey and then act on these insights to
improve the customer experience.
Most big data architectures include some or all of the following components:
Data Sources: All Big Data solutions start with one or more data sources. Examples include:
Application data stores, such as relational databases.
Static files produced by applications, such as web server log files.
Real-time data sources, such as IOT devices.
Data storage: Data for batch processing operations is typically stored in a distributed file store which can
hold large volumes of large files in various formats. This type of store is often called a data lake. Options for
implementing this storage include Azure Data Lake Store or blob containers in Azure Storage.
Batch processing: Because datasets are so large, a big data solution must often process data files using
long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis . Usually these
jobs involve reading source files, processing them, and writing the output to new files. Options include
running U-SQL jobs in Azure Data Lake Analytics, using custom Hive, Pig, or Map/Reduce jobs in an
HDInsight Hadoop cluster, or using Java, Scala, or Python programs in a cluster HD Insight Spark.
Real-time message ingestion: If the solution includes real-time sources, the architecture must include a
means to capture and store real-time messages for stream processing. It can be a simple data store, where
incoming messages are dropped into a folder for processing. However, many solutions require a message
ingestion store to act as a buffer for messages and to support scalable processing, reliable delivery, and
other message queuing semantics. Options include Azure Event Hubs, Azure IoT Hubs, and Kafka.
Stream processing: After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an
output sink. Azure Stream Analytics provides a managed stream processing service based on perpetually
running SQL queries that operate on unbounded streams. You can also use open source Apache streaming
technologies like Storm and Spark Streaming in an HDInsight cluster.
Analytical data store: Many big data solutions prepare data for analysis and then serve the processed data
in a structured format that can be queried using analytical tools. The analytical data store used to serve
these queries can be a Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency NoSQL
technology such as HBase, or an interactive Hive database that provides a metadata abstraction over data
BIRMINGHAM CITY UNIVERSITY
FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT
COURSEWORK ASSIGNMENT BRIEF
CMP7203 Big Data Management
files in the distributed data store. Azure Synapse Analytics provides a managed service for large-scale,
cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which can also
be used to serve data for analysis.
Analysis and reporting: The goal of most big data solutions is to provide insights into the data through
analysis and reporting. To empower users to analyze the data, the architecture may include a data
modeling layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It
might also support self-service BI, using the modeling and visualization technologies in Microsoft Power BI
or Microsoft Excel. Analysis and reporting can also take the form of interactive data exploration by data
scientists or data analysts. For these scenarios, many Azure services support analytical notebooks, such as
Jupyter, enabling these users to leverage their existing skills with Python or R. For large-scale data
exploration, you can use Microsoft R Server, either standalone or with Spark.
Orchestration: Most big data solutions consist of repeated data processing operations, encapsulated in
workflows, that transform source data, move data between multiple sources and sinks, load the processed
data into an analytical data store, or push the results straight to a report or dashboard . To automate these
workflows, you can use an orchestration technology such Azure Data Factory or Apache Ozzie and Sqoop.