Ccs 334
Ccs 334
Big data analytics describes the process of uncovering trends, patterns, and correlations in large
amounts of raw data to help make data-informed decisions. These processes use familiar statistical
analysis techniques—like clustering and regression—and apply them to more extensive datasets with
the help of newer tools.
1. Collect Data
Data collection looks different for every organization. With today’s technology, organizations
can gather both structured and unstructured data from a variety of sources — from cloud storage to
mobile applications to in-store IoT sensors and beyond. Some data will be stored in data
warehouses where business intelligence tools and solutions can access it easily. Raw or unstructured
data that is too diverse or complex for a warehouse may be assigned metadata and stored in a data lake.
2. Process Data
Once data is collected and stored, it must be organized properly to get accurate results on
analytical queries, especially when it’s large and unstructured. Available data is growing exponentially,
making data processing a challenge for organizations. One processing option is batch processing,
which looks at large data blocks over time. Batch processing is useful when there is a longer turnaround
time between collecting and analyzing data. Stream processing looks at small batches of data at once,
shortening the delay time between collection and analysis for quicker decision-making. Stream
processing is more complex and often more expensive.
3. Clean Data
Data big or small requires scrubbing to improve data quality and get stronger results; all data
must be formatted correctly, and any duplicative or irrelevant data must be eliminated or accounted for.
Dirty data can obscure and mislead, creating flawed insights.
4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics processes can
turn big data into big insights. Some of these big data analysis methods include:
Data mining sorts through large datasets to identify patterns and relationships by identifying
anomalies and creating data clusters.
Predictive analytics uses an organization’s historical data to make predictions about the future,
identifying upcoming risks and opportunities.
Deep learning imitates human learning patterns by using artificial intelligence and machine
learning to layer algorithms and find patterns in the most complex and abstract data.
3. Big data analytics tools and technology
Big data analytics cannot be narrowed down to a single tool or technology. Instead, several types of
tools work together to help you collect, process, cleanse, and analyze big data. Some of the major
players in big data ecosystems are listed below.
Hadoop is an open-source framework that efficiently stores and processes big datasets on
clusters of commodity hardware. This framework is free and can handle large amounts of
structured and unstructured data, making it a valuable mainstay for any big data operation.
NoSQL databases are non-relational data management systems that do not require a fixed
scheme, making them a great option for big, raw, unstructured data. NoSQL stands for “not
only SQL,” and these databases can handle a variety of data models.
MapReduce is an essential component to the Hadoop framework serving two functions. The
first is mapping, which filters data to various nodes within the cluster. The second is reducing,
which organizes and reduces the results from each node to answer a query.
YARN stands for “Yet Another Resource Negotiator.” It is another component of second-
generation Hadoop. The cluster management technology helps with job scheduling and resource
management in the cluster.
Spark is an open source cluster computing framework that uses implicit data parallelism and
fault tolerance to provide an interface for programming entire clusters. Spark can handle both
batch and stream processing for fast computation.
Tableau is an end-to-end data analytics platform that allows you to prep, analyze, collaborate,
and share your big data insights. Tableau excels in self-service visual analysis, allowing people
to ask new questions of governed big data and easily share those insights across the
organization.
Social media: Social media has a component of semi-structured data (e.g., data that does not
conform to a data model but has some structure) but the content of each social media message
itself is unstructured.
Email: While we sometimes consider this semi-structured, email message fields are text fields
that are not easily analyzed. Email content may include video, audio, or photo content as well,
making them unstructured.
Text files: Almost all traditional business files — including word processing documents (e.g.,
Google Docs or Microsoft Word), presentations (e.g., Microsoft PowerPoint), notes, and PDFs
— are classified as unstructured data.
Survey responses: When open-ended feedback is gathered via survey (e.g., text box) or through
respondents selecting "liked" photos, unstructured data is being gathered.
Scientific data: Scientific data can include field surveys, space exploration, seismic imagery,
atmospheric data, topographic data, weather data, and medical data. While these types of data
may have a base structure for collection, the data itself is often unstructured and may not lend
itself to traditional analysis tools and dashboards.
Machine and sensor data: Billions of small files from IoT (Internet of Things) devices, such as
mobile phones and iPads, generate significant amounts of unstructured data. In addition,
business systems’ log files, which are not consistent in structure, also create vast amounts of
unstructured data.
Unstructured data has an internal structure but does not contain a predetermined data model or
schema. It can be textual or non-textual. It can be human-generated or machine-generated. One of the
most common types of unstructured data is text.
Unstructured simply means that it is datasets (typical large collections of files) that aren't stored in a
structured database format. Unstructured data has an internal structure, but it's not predefined through
data models. It might be human generated, or machine generated in a textual or a non-textual format.
For instance, a photo can be TIFF, JPEG, GIF, PNG, or RAW, each with their own characteristics. Text
files: Almost all traditional business files, including your word processing documents, presentations,
notes, and PDFs, are unstructured data.
Big data comes from a variety of sources, including internal systems like a company's
transaction processing system, customer databases, medical records, internet browsing history, social
networks and digital documents like emails.
Big data is a term to describe the large amounts of data traveling into an enterprise day upon
day. However, it isn’t the volume of data which is important. It’s what organisations do with the data
which truly gives value to a business. Big data can be analysed for insights which leads to better
decisions and strategic business moves, and it is used across nearly all industries. Below, are 6
examples of big data being used across some of the main industries in the UK.
1.) Retail Good customer service and building customer relationships is vital in the retail industry. The
best ways to build and maintain this service and relationship is through big data analysis. Retail
companies need to understand the best techniques to market their products to their customers, the best
process to manage transactions and the most efficient and strategic way to bring back lapsed customers
in such a competitive industry.
2.) Banking Due to the amount of data streaming into banks from a wide variety of channels, the
banking sector needs new means to manage big data. Of course like the retail industry and all others it is
important to build relationships, but banks must also minimise fraud and risk whilst at the same time
maintaining compliance.
3.) Manufacturing Manufacturers can use big data to boost their productivity whilst also minimising
wastage and costs - processes which are welcomed in all sectors but vital within manufacturing. There
has been a large cultural shift by many manufacturers to embrace analytics in order to make more
speedy and agile business decisions.
4.) Education Schools and colleges which use big data analysis can make large positive differences to
the education system, its employees and students. By analysing big data, schools are supplied with the
intel needed to implement a better system for evaluation and support of teachers, to make sure students
are progressing and identifying at risk pupils.
5.) Government The Government has a large scope to make change to the community we live in as a
whole when utilising big data, such as dealing with traffic congestion, preventing crime, running
agencies and managing utilities. Governments however need to address the issues of privacy and
transparency.
6.) Health Care Health Care is one industry where lives could be at stake if information isn’t quick,
accurate and in some cases, transparent enough to satisfy strict industry regulations. When big data is
analysed effectively, health care providers can uncover insights that can find new cures and improve the
lives of everyone.
Web analytics
helps you understand more about user demographics, goals, and behavior, so you can tailor your
website's content and product offerings to their specific needs.
The objective of web analytics is to serve as a business metric for promoting specific products to the
customers who are most likely to buy them and to determine which products a specific customer is most
likely to purchase. This can help improve the ratio of revenue to marketing costs.
Web analytics refers to the process of collecting website data and then processing, reporting,
and analyzing it to create an online strategy for improving the website experience.
Web analytics tools are software designed to track, measure, and report on website activity
including site traffic, visitor source, and user clicks. Using web analytics tools helps you understand
what's happening on your website and get insights on what's working (and what's not).
1. Big Data in Education Industry : The education industry is flooding with massive amounts of data.
The data is generally related to students, faculty, courses, results, and whatnot. The proper insights into
these data can be used to improve the operational effectiveness and working of education institutes.
E-learning brings such a revolution in the education system. E-learning is a systematic and
properly managed platform for education.
The government sector generally has to deal with massive amounts of data on an almost daily
basis. They usually keep track of various records and databases of the cities, states, growth, energy
resources, geographical surveys, etc. Big data helps in proper analysis and study of this data. It helps the
government in endless ways.
For example:
The Food and Drug Administration (FDA), which runs under the jurisdiction of the Federal
Government of the USA, uses Big data to analyze and discover patterns and associations. Using this
analysis, they identify and examine the expected or unexpected occurrences of food-based infections.
a. It helps to detect the diseases by identifying them in the early stages. It prevents the disease from
getting any worse. This, in turn, makes the treatment easy and effective.
b. It helps to predict the outbreak of an epidemic and detect the preventive measures that need to be
taken to minimize the effects of the same.
c. It helps in research on past medical results so that patients can be provided with better services and
medicines.
For Example: Apple has come up with Apple HealthKit, CareKit, and ResearchKit. The aim is to
empower the iPhone users to store and access their real-time health records in their cellphones.
All the data collected contribute to Big data and can be used in different ways, such as:
a. To study global warming.
b. In weather forecasting.
c. For preparations in the case of crisis.
d. To understand the patterns of natural disasters etc.
For Example: IBM Deep Thunder provides weather forecasting through high-performance computing
of Big data.
5. Big Data in Banking Sector
According to GDC prognosis, data in the Banking Sector is estimated to grow by 700 percent
by the end of the next year. Proper analysis and study of this data help detect any and all illegal
activities that are being carried out, such as:
a. Venture credit hazard treatment. b. Misuse of credit/debit cards.
c. Customer statistics alteration. d. Risk mitigation.
e. Money laundering.
For Example:
Various software such as SAS AML (anti-money laundering) use Data Analytics to detect
suspicious transactions and analyze customer data.
The benefits extracted from big data in the media and entertainment industry are:
a. Real-time estimation of congestion and traffic control patterns. For example – we are generally using
Google Maps to locate the least traffic-prone routes.
b. Big data is used to understand and estimate the user’s needs on different routes and then utilize route
planning to reduce wait time.
c. Predictive analysis and Real-time processing of Big data help to identify the accident-prone areas. It
helps to reduce accidents and increase the safety level of traffic.
For Example:
Uber generally generates and uses a vast amount of data regarding drivers, vehicles, locations,
etc. All this data is analyzed and used to predict supply, location, demand, fares that need to be set for
every trip, etc.
a. Predicting the number of customers at the specific time of the day so that they can deliver the
food according to demand.
b. Collecting information about customer’s interests that helps them to design marketing
strategies and follow trends.
c. When there are more customers at the food counter, Big data applications show only those
items on the menu card or screen, which can be prepared within a short time.
d. Evaluating the performance of a branch and then fixing the locations where the new outlets
or offices should be opened in order to increase the profit.
Study how the performance and efficiency of the retail sector get improved by Big data:
Big data in telecommunication helps to visualize the transferred data that helps to make better
management and customer satisfaction.
a. Visualizing techniques that are defined by algorithms can detect fraud groups who illegally
access the system through fake profiles.
b. Big data help telecommunication companies to differentiate target audiences and add policies
according to customer segmentation.
c. Predictive analytics helps to get customer feedback that will help to analyze customer
preferences and interests.
d. Analysis of the current network management and customer engagement rate becomes easy;
this information will also help to get to know the area of improvement.
12. Big Data in the Airline Industry
Big data has its best utilization in the airline industry as it provides them with minute-to-minute
operational data. It helps them to gather information about customer service, weather forecasts,
ticketing, etc. It also helps to take decisions for customer satisfaction and to meet demands.
Let us look at some best uses of Big data for the airline industry:
a. Smart maintenance of aircraft by comparing operating costs, fuel quantity, and costs, etc.
b. Improving the safety security of flights by capturing flight-related data, incident data, and it
also strengthens aviation chain links.
c. Enhancing customer service and customer’s buying habits by analyzing past information.
Example- helps to check the real-time baggage status of customers.
d. Determining air traffic control, in-flight telemetry data information to have a comfortable
flight.
13. Big Data in Disaster Management
Almost every year, natural calamities like floods, hurricanes, earthquakes cause colossal
damage to land, animals, humans, etc. And many times, Scientists are not able to predict the possibility
of disaster. But the introduction of Big data in disaster management provides a new dimension for the
betterment of things.
The recent development of artificial intelligence (AI), data mining, and visualization are
helping meteorologists to forecast weather conditions more accurately. Some of the uses of Big data in
disaster management are:
a. It helps the government to take necessary actions to reduce the adverse effects of natural disaster.
b. Analyzing data collected from satellites and radar that helps in examining the weather conditions
every 12 hours.
c. Identifying the possibility of disasters by evaluating temperature, water level, wind pressure, and
other related factors.
d. It involves clustering techniques, visualization, streamflow simulation, and association rules that can
generate highly accurate results.
a. The government collects all the information about citizens of its nation, and this data is
further maintained in a proper database.
b. It helps in evaluating the population’s density in a specific location and identifying the
possible threatening situations even before anything occurred.
c. Security officers use the maintained database to find any criminal or to detect any fraudulent
activities in any area of the country.
d. It also helps to predict the potential outspread of any virus or diseases and to take the
necessary actions to prevent it.
15. Big Data in Ecommerce
E-Commerce is one of the best ways through which people can earn online. Ecommerce enjoys
the benefits of operating online but also faces many challenges to achieve business objectives.
Big data in e-commerce provides a competitive advantage by delivering insights and analytical
reports. Let’s look at some more benefits of Big data in e-commerce:
a. Collects the data and customer requirements even before the official operation has started.
b. Creating a high-performance marketing model and setting a startup according to current
requirements and trends.
c. Identification of the most viewed products and the pages that appeared the maximum number
of times can be easily done. This results in further enhancement of the e-commerce business.
d. Most importantly, it helps to evaluate the customer’s behavior and suggests the products
according to their interests. It helps to increase the number of sales and generate revenue.
e. Big data applications can create a report depending on the visitor’s age, gender, location, etc.
Top Big Data Technologies
We can categorize the leading big data technologies into the following four sections:
o Data Storage
Hadoop - When it comes to handling big data, Hadoop is one of the leading
technologies that come into play. This technology is based entirely on map-reduce architecture
and is mainly used to process batch information. Also, it is capable enough to process tasks in
batches. The Hadoop framework was mainly introduced to store and process data in a
distributed data processing environment parallel to commodity hardware and a basic
programming execution model.
Cassandra- Cassandra is one of the leading big data technologies among the list of top NoSQL
databases.
RainStor - RainStor is a popular database management system designed to manage and analyze
organizations' Big Data requirements. It uses deduplication strategies that help manage storing
and handling vast amounts of data for reference
Hunk - Hunk is mainly helpful when data needs to be accessed in remote Hadoop clusters
using virtual indexes. This helps us to use the spunk search processing language to analyze
data. Also, Hunk allows us to report and visualize vast amounts of data from Hadoop and
NoSQL data sources.
o Data Mining : Presto is an open-source
RapidMiner is defined as the data science software that offers us a very robust and
powerful graphical user interface to create, deliver, manage, and maintain predictive analytics
o Data Analytics
Apache Kafka is a popular streaming platform. This streaming platform is primarily
known for its three core capabilities: publisher, subscriber and consumer. It is referred to as a
distributed streaming platform. It is also defined as a direct messaging, asynchronous
messaging broker system that can ingest and perform data processing on real-time streaming
data. This platform is almost similar to an enterprise messaging system or messaging queue.
plunk is known as one of the popular software platforms for capturing, correlating, and
indexing real-time streaming data in searchable repositories. Splunk can also produce graphs,
alerts, summarized reports, data visualizations, and dashboards, etc.
Apache Spark is one of the core technologies in the list of big data technologies. It is
one of those essential technologies which are widely used by top companies.
o Data Visualization
Tableau is one of the fastest and most powerful data visualization tools used by leading
business intelligence industries. It helps in analyzing the data at a very faster speed. Tableau
helps in creating the visualizations and insights in the form of dashboards and worksheets.
Plotly is best suited for plotting or creating graphs and relevant components at a faster speed in
an efficient way.
8. What is Hadoop : Hadoop is an open source framework from Apache and is used to store process
and analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo,
Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the
cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of
that HDFS was developed. It states that the files will be broken into blocks and stored in nodes
over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation
on data using key value pair. The Map task takes input data and converts it into a data set which
can be computed in Key value pair. The output of Map task is consumed by reduce task and
then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop
modules.
Hadoop Architecture:
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.
Hadoop Distributed File System: The Hadoop Distributed File System (HDFS) is a distributed file
system for Hadoop. It contains a master/slave architecture. This architecture consist of a single
NameNode performs the role of master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run the
NameNode and DataNode software.
NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and
closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by
using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can
also be called as a Mapper.
MapReduce Layer: The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task
Trackers. Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is
rescheduled.
Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing the
processing time. It is able to process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really
cost effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network,
so if one node is down or some other network failure happens, then Hadoop takes the other
copy of data and use it. Normally, data are replicated thrice but the replication factor is
configurable.
History of Hadoop: The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin
was the Google File System paper, published by Google.
Ubuntu: Successfully launched in the year 2004, Ubuntu is the most popular Linux distribution in
today’s world, especially on the desktop side. Ubuntu is an open source software platform which runs
almost in every aspect from IoT devices, the Smartphone, the tablet and the PC to the server and the
cloud.
MySQL
MySQL is the most widely and highly popular Open Source Relational SQL database
management system used for database server in the world
Used for the popular websites and services which includes Wikipedia, Facebook and, more
modestly, our very own Pingdom.com…). It is used LAMP stack (Linux, Apache, MySQL and
PHP).
MySQL is one of the best RDBMS used for developing web-based software applications.
Want to grab open source database job? Get enrolled in PHP and MySQL training course today!
Apache: The Apache HTTP Server the most popular in the field of web server software since 1996
when it got started. Apache is still famous for its strong lead, outclassing and been the second runner up
from IIS in terms of number of deployed websites (according to Netcraft, whereas Apache is currently
used by 46% of all websites, while IIS is used by 29%). In 2009 it become the first web server to be
used by more than 100 million websites
Java: Java technologies are a high-level programming language originally developed by Sun
Microsystems and have powered enterprise applications since 1995. Java is a source for more than 800
million PCs,more than two billion handheld devices and 3.5 billion smart cards is been implemented
host of set-top boxes, Web cams, games, medical devices and much more. Major companies such as
Oracle (Java’s new owner) and IBM where Java is consider a technology to watch and embrace for all
levels of enterprise use.
Why going for Open Source technology?
Seeing the sudden boom in open source technologies, you must be thinking “why the companies are
increasingly adopting technologies instead of proprietary ones?”. Here are the reasons behind this-
Cutting down the cost- Though companies are investing heavily in acquiring the latest and
advanced technologies, but saving the cost has always remained a concern for the companies.
Open source technologies relieved the firms from spending money for this purpose. These are
available free of cost an ready to be deployed which saves significant amount of money.
Quality improvement- Developers believe that the source code can be customized or modified
as and when required which was not possible initially. Whether it is bug removal or a
modification, changing the code improves the quality of technology.
Agile business- With the increasing speed of data, it has become imperative for the companies
to become instant responsive. Adopting open-source technologies is one way of achieving that.
These software speed up the development process as it is rather quicker to debug an existing
code than writing a new one.
Mitigate business risk- Technology is changing rapidly causing it hard for the companies to
rely on a single vendor or multiple vendors of software. Using open source software stack is the
solution to this problem as its transparency allows the companies to build a vibrant community
and in turn lessens the business risk.
Wisdom - Wisdom of crowds is the idea that large groups of people are collectively smarter
than individual experts when it comes to problem-solving or identifying values (like the
weight of a cow or number of jelly beans in a jar).
Creation - Crowd creation is a collaborative effort to design or build something. Wikipedia
and other wikis are examples of this. Open-source software is another good example.
Voting - Crowd voting uses the democratic principle to choose a particular policy or course of
action by "polling the audience."
Funding - Crowdfunding involved raising money for various purposes by soliciting relatively
small amounts from a large number of funders.
16. What do you mean by inter and trans fire wall analytics.
Inter-firewall analytics
Focus: Analyzes traffic flows between different firewalls within a network.
Methodology: Utilizes data collected from multiple firewalls to identify anomalies and potential
breaches.
Benefits: Provides a comprehensive view of network traffic flow and helps identify lateral
movement across different security zones.
Limitations: Requires deployment of multiple firewalls within the network and efficient data
exchange mechanisms between them.
Trans-firewall analytics
Focus: Analyzes encrypted traffic that traverses firewalls, which traditional security solutions may
not be able to decrypt and inspect.
Methodology: Uses deep packet inspection (DPI) and other advanced techniques to analyze the
content of encrypted traffic without compromising its security.
Benefits: Provides insight into previously hidden threats within encrypted traffic and helps detect
sophisticated attacks.
Limitations: Requires specialized hardware and software solutions for DPI, and raises concerns
regarding potential data privacy violations.
Difference between inter and trans fire:
Methodolog Analyzes data from multiple Uses DPI and other techniques to
y firewalls analyze encrypted traffic