Module– 2:
Big Data Technologies
Hadoop’s Parallel World–Data discovery–Open Source technology for Big Data Analytics–
cloud and Big Data–Predictive Analytics –Mobile Business Intelligence and Big Data.
What is Big Data Technologies?
Big data technology is defined as software-utility. This technology is primarily designed to
analyze, process and extract information from a large data set and a huge set of extremely
complex structures. This is very difficult for traditional data processing software to deal with.
Among the larger concepts of rage in technology, big data technologies are widely associated
with many other technologies such as deep learning, machine learning, artificial intelligence
(AI), and Internet of Things (IoT) that are massively augmented. In combination with these
technologies, big data technologies are focused on analyzing and handling large amounts of
real-time data and batch-related data.
Hadoop Parallel world:
Hadoop is an open source framework. It is provided by Apache to process and analyze very
huge volume of data. It is written in Java and currently used by Google, Facebook, LinkedIn,
Yahoo, Twitter etc.
Modules of Hadoop:
Apache Hadoop is composed of four core modules that facilitate its functionality for
distributed storage and processing of large datasets:
Hadoop Common:
This module provides the essential utilities and libraries that support the other Hadoop
modules. It contains the necessary Java libraries and scripts required to start Hadoop and is
utilized by HDFS, YARN, and MapReduce.
Hadoop Distributed File System (HDFS):
HDFS is a distributed file system designed to store very large files across multiple
machines in a cluster. It provides high-throughput access to application data and ensures
fault tolerance by replicating data blocks across different nodes.
Hadoop YARN (Yet Another Resource Negotiator):
YARN is the resource management layer of Hadoop. It is responsible for managing
compute resources in clusters and scheduling user applications. YARN separates resource
management from job scheduling, allowing various processing engines beyond MapReduce
to run on Hadoop.
Hadoop MapReduce:
MapReduce is a programming model and processing engine for parallel processing of large
datasets. It provides a framework for developing applications that process vast amounts of
data in a distributed and fault-tolerant manner across a cluster of commodity hardware.
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts
it into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.
What is data discovery?
Data discovery enables your organization to identify, catalog, and classify business-critical
and sensitive data, so you can govern it for meaningful purposes with increased transparency.
Data discovery helps you:
Uncover new insights for opportunities in business value creation
Apply data protection to lower risk exposure from abuse and comply with privacy
mandates
Drive similar high-value business outcomes where data is the fuel of modern business
operations.
Data discovery provides the data intelligence an organization needs to develop new products
and services, optimize data use, and protect data from risk exposure. The result enables
greater opportunities for new revenue sources when collecting greater volumes of data
discovered across today’s modern enterprises.
As an example, information captured from a company’s consumers, such as personal
preferences and transaction records, may lack the necessary data transparency needed when
scattered across enterprise systems. Data discovery helps automate building a metadata
repository using AI and machine learning to accelerate an understanding of where data is
located, where it’s being moved and used, and help determine its value to an organization to
make it available through data democratization efforts, such as a data marketplace.
Open Source technology for Big Data Analytics
Apache Hadoop is a collection of open-source software utilities that facilitates using a
network of many computers to solve problems involving massive amounts of data and
computation. It provides a software framework for distributed storage and processing of big
data using the MapReduce programming model. Hadoop was originally designed for
computer clusters built from commodity hardware, which is still the common use. It has since
also found use on clusters of higher-end hardware. All the modules in Hadoop are designed
with a fundamental assumption that hardware failures are common occurrences and should be
automatically handled by the framework.
A Brief History of Apache Hadoop:
Apache Hadoop is a big data analytics tool that is a java based free software framework. It
helps in the effective storage of a vast amount of data in a storage place known as a cluster.
The special feature of this framework is it runs in parallel on a cluster and also can process
huge data across all nodes in it. There is a storage system in Hadoop popularly known as the
Hadoop Distributed File System (HDFS), which helps to splits the large volume of data and
distribute it across many nodes present in a cluster. It also performs the replication process of
data in a cluster hence providing high availability and recovery from failure – which
increases the fault tolerance.
Cloud and BigData:
Cloud Computing: It is an on-demand delivery of resources like servers, databases,
networking, software, analytics, applications and computational power over the Internet to
promote speed and flexibility as well as the economy of scale. It helps in lowering
operational costs and is much more reliable. Vast amounts of computing resources can be
delivered within minutes or even less.
Big Data Analytics: It is the process of observing complicated patterns and relationships
within large volumes of varied data, the big data, and using that analysis to make informed
and effective business decisions. Large data sets are analyzed to draw conclusions about
them.Below is a table of differences between Cloud Computing and Big Data Analytics:
Data Analytics: It is the process of deducing the logical sets and patterns by filtering and
applying required transformations and models on raw data. The following steps can be
followed to explore the behavioral pattern of data and draw the necessary conclusions.
The top tools available for data analytics in the market are R Programming, Python, SAS,
Tableau Public, KNIME, Apache Spark, Excel, QlikView, and OpenRefine.
Predictive Analytics:
Predictive Analytics: It encompasses making predictions about future outcomes by studying
current and past data trends. It utilizes data modeling, data mining, machine learning, and
deep learning algorithms to extract the required information from data and project behavioral
patterns for future.
Some industry tools used for Predictive analytics are Periscope Data, Google AI Platform,
SAP Predictive Analytics, Anaconda, Microsoft Azure, Rapid Insight Veera and KNIME
Analytics Platform.
Mobile Business Intelligence and Big Data :
Mobile Business Intelligence (BI) is the ability to access and perform BI-related data analysis
on mobile devices and tablets. It can help users make data-driven decisions wherever they are
Mobile BI is different from big data, which is a term for large and complex data sets that
require advanced tools and techniques to process and analyze.
With the introduction of business intelligence software, managers and executives have
typically had access to necessary information on traditional computer desktops and laptops.
As mobile computing device use has increased, including the use of Internet-capable mobile
phones, business intelligence applications have been developed for these devices. Mobile
business intelligence applications allow users to gain access to the software that stores the
information they need.
Need for mobile BI:
Mobile phones' data storage capacity has grown with their use. You are expected to make
decisions and act quickly in this fast-paced environment. The number of businesses receiving
assistance in such a situation is growing by the day.
To expand your business or boost your business productivity, mobile BI can help, and it
works with both small and large businesses. Mobile BI can help you whether you are a
salesperson or a CEO. There is a high demand for mobile BI in order to reduce information
time and use that time for quick decision making.
Advantages of mobile BI
1. Simple access
Mobile BI is not restricted to a single mobile device or a certain place. You can view
your data at any time and from any location. Having real-time visibility into a firm
improves production and the daily efficiency of the business. Obtaining a company's
perspective with a single click simplifies the process.
2. Competitive advantage
Many firms are seeking better and more responsive methods to do business in order to
stay ahead of the competition. Easy access to real-time data improves company
opportunities and raises sales and capital. This also aids in making the necessary
decisions as market conditions change.
3. Simple decision-making
As previously stated, mobile BI provides access to real-time data at any time and from
any location. During its demand, Mobile BI offers the information. This assists
consumers in obtaining what they require at the time. As a result, decisions are made
quickly.
4. Increase Productivity
By extending BI to mobile, the organization's teams can access critical company data
when they need it. Obtaining all of the corporate data with a single click frees up a
significant amount of time to focus on the smooth and efficient operation of the firm.
Increased productivity results in a smooth and quick-running firm.
Disadvantages of mobile
1. Stack of data :The primary function of a mobile BI is to store data in a systematic
manner and then present it to the user as required. As a result, Mobile BI stores all of
the information and does end up with heaps of earlier data. The corporation only
needs a small portion of the previous data, but they need to store the entire
information, which ends up in the stack
2. Expensive
Mobile BI can be quite costly at times. Large corporations can continue to pay for
their expensive services, but small businesses cannot. As the cost of mobile BI is not
sufficient, we must additionally consider the rates of IT workers for the smooth
operation of BI, as well as the hardware costs involved.