Ccs 334 | PDF | Apache Hadoop | Big Data
0% found this document useful (0 votes)
230 views

Ccs 334

Big data analytics describes the process of uncovering trends, patterns, and correlations in large amounts of raw data to help make data-informed decisions. It involves collecting data from various sources, processing and cleaning the data, then analyzing it using techniques like data mining, predictive analytics, and deep learning. Tools like Hadoop, NoSQL databases, MapReduce, and Spark help store and analyze big data. Benefits of big data analytics include cost savings, better understanding of customer needs to improve product development, and gaining market insights from analyzing purchase behavior and trends. Unstructured data makes up a significant portion of enterprise data and includes things like social media, emails, documents, and machine/sensor data. Big data is used across industries like
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
230 views

Ccs 334

Big data analytics describes the process of uncovering trends, patterns, and correlations in large amounts of raw data to help make data-informed decisions. It involves collecting data from various sources, processing and cleaning the data, then analyzing it using techniques like data mining, predictive analytics, and deep learning. Tools like Hadoop, NoSQL databases, MapReduce, and Spark help store and analyze big data. Benefits of big data analytics include cost savings, better understanding of customer needs to improve product development, and gaining market insights from analyzing purchase behavior and trends. Unstructured data makes up a significant portion of enterprise data and includes things like social media, emails, documents, and machine/sensor data. Big data is used across industries like
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

1. What is big data analytics?

Big data analytics describes the process of uncovering trends, patterns, and correlations in large
amounts of raw data to help make data-informed decisions. These processes use familiar statistical
analysis techniques—like clustering and regression—and apply them to more extensive datasets with
the help of newer tools.

2. How big data analytics works


Big data analytics refers to collecting, processing, cleaning, and analyzing large datasets to help
organizations operationalize their big data.

1. Collect Data
Data collection looks different for every organization. With today’s technology, organizations
can gather both structured and unstructured data from a variety of sources — from cloud storage to
mobile applications to in-store IoT sensors and beyond. Some data will be stored in data
warehouses where business intelligence tools and solutions can access it easily. Raw or unstructured
data that is too diverse or complex for a warehouse may be assigned metadata and stored in a data lake.
2. Process Data
Once data is collected and stored, it must be organized properly to get accurate results on
analytical queries, especially when it’s large and unstructured. Available data is growing exponentially,
making data processing a challenge for organizations. One processing option is batch processing,
which looks at large data blocks over time. Batch processing is useful when there is a longer turnaround
time between collecting and analyzing data. Stream processing looks at small batches of data at once,
shortening the delay time between collection and analysis for quicker decision-making. Stream
processing is more complex and often more expensive.

3. Clean Data
Data big or small requires scrubbing to improve data quality and get stronger results; all data
must be formatted correctly, and any duplicative or irrelevant data must be eliminated or accounted for.
Dirty data can obscure and mislead, creating flawed insights.

4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics processes can
turn big data into big insights. Some of these big data analysis methods include:
 Data mining sorts through large datasets to identify patterns and relationships by identifying
anomalies and creating data clusters.
 Predictive analytics uses an organization’s historical data to make predictions about the future,
identifying upcoming risks and opportunities.
 Deep learning imitates human learning patterns by using artificial intelligence and machine
learning to layer algorithms and find patterns in the most complex and abstract data.
3. Big data analytics tools and technology
Big data analytics cannot be narrowed down to a single tool or technology. Instead, several types of
tools work together to help you collect, process, cleanse, and analyze big data. Some of the major
players in big data ecosystems are listed below.

 Hadoop is an open-source framework that efficiently stores and processes big datasets on
clusters of commodity hardware. This framework is free and can handle large amounts of
structured and unstructured data, making it a valuable mainstay for any big data operation.
 NoSQL databases are non-relational data management systems that do not require a fixed
scheme, making them a great option for big, raw, unstructured data. NoSQL stands for “not
only SQL,” and these databases can handle a variety of data models.
 MapReduce is an essential component to the Hadoop framework serving two functions. The
first is mapping, which filters data to various nodes within the cluster. The second is reducing,
which organizes and reduces the results from each node to answer a query.
 YARN stands for “Yet Another Resource Negotiator.” It is another component of second-
generation Hadoop. The cluster management technology helps with job scheduling and resource
management in the cluster.
 Spark is an open source cluster computing framework that uses implicit data parallelism and
fault tolerance to provide an interface for programming entire clusters. Spark can handle both
batch and stream processing for fast computation.
 Tableau is an end-to-end data analytics platform that allows you to prep, analyze, collaborate,
and share your big data insights. Tableau excels in self-service visual analysis, allowing people
to ask new questions of governed big data and easily share those insights across the
organization.

The big benefits of big data analytics


The ability to analyze more data at a faster rate can provide big benefits to an organization,
allowing it to more efficiently use data to answer important questions. Big data analytics is important
because it lets organizations use colossal amounts of data in multiple formats from multiple sources to
identify opportunities and risks, helping organizations move quickly and improve their bottom lines.
Some benefits of big data analytics include:

 Cost savings. Helping organizations identify ways to do business more efficiently


 Product development. Providing a better understanding of customer needs
 Market insights. Tracking purchase behavior and market trends

4. What is the convergence of key trends in big data?


Data is the ultimate resource that fuels AI models to acquire new skills. Big data and AI,
therefore, share a synergetic relationship wherein AI algorithms can extract unprecedented insights
from big dataset

Unstructured data usage


Unstructured data, or nonrelational data, makes up a significant portion of the enterprise data that exists
today. Examples of unstructured data used every day include:

 Social media: Social media has a component of semi-structured data (e.g., data that does not
conform to a data model but has some structure) but the content of each social media message
itself is unstructured.

 Email: While we sometimes consider this semi-structured, email message fields are text fields
that are not easily analyzed. Email content may include video, audio, or photo content as well,
making them unstructured.

 Text files: Almost all traditional business files — including word processing documents (e.g.,
Google Docs or Microsoft Word), presentations (e.g., Microsoft PowerPoint), notes, and PDFs
— are classified as unstructured data.

 Survey responses: When open-ended feedback is gathered via survey (e.g., text box) or through
respondents selecting "liked" photos, unstructured data is being gathered.

 Scientific data: Scientific data can include field surveys, space exploration, seismic imagery,
atmospheric data, topographic data, weather data, and medical data. While these types of data
may have a base structure for collection, the data itself is often unstructured and may not lend
itself to traditional analysis tools and dashboards.

 Machine and sensor data: Billions of small files from IoT (Internet of Things) devices, such as
mobile phones and iPads, generate significant amounts of unstructured data. In addition,
business systems’ log files, which are not consistent in structure, also create vast amounts of
unstructured data.
Unstructured data has an internal structure but does not contain a predetermined data model or
schema. It can be textual or non-textual. It can be human-generated or machine-generated. One of the
most common types of unstructured data is text.

Unstructured simply means that it is datasets (typical large collections of files) that aren't stored in a
structured database format. Unstructured data has an internal structure, but it's not predefined through
data models. It might be human generated, or machine generated in a textual or a non-textual format.

For instance, a photo can be TIFF, JPEG, GIF, PNG, or RAW, each with their own characteristics. Text
files: Almost all traditional business files, including your word processing documents, presentations,
notes, and PDFs, are unstructured data.

INDUSTRY EXAMPLES OF BIG DATA

Big data comes from a variety of sources, including internal systems like a company's
transaction processing system, customer databases, medical records, internet browsing history, social
networks and digital documents like emails.

Big data is a term to describe the large amounts of data traveling into an enterprise day upon
day. However, it isn’t the volume of data which is important. It’s what organisations do with the data
which truly gives value to a business. Big data can be analysed for insights which leads to better
decisions and strategic business moves, and it is used across nearly all industries. Below, are 6
examples of big data being used across some of the main industries in the UK.

1.) Retail Good customer service and building customer relationships is vital in the retail industry. The
best ways to build and maintain this service and relationship is through big data analysis. Retail
companies need to understand the best techniques to market their products to their customers, the best
process to manage transactions and the most efficient and strategic way to bring back lapsed customers
in such a competitive industry.

2.) Banking Due to the amount of data streaming into banks from a wide variety of channels, the
banking sector needs new means to manage big data. Of course like the retail industry and all others it is
important to build relationships, but banks must also minimise fraud and risk whilst at the same time
maintaining compliance.
3.) Manufacturing Manufacturers can use big data to boost their productivity whilst also minimising
wastage and costs - processes which are welcomed in all sectors but vital within manufacturing. There
has been a large cultural shift by many manufacturers to embrace analytics in order to make more
speedy and agile business decisions.

4.) Education Schools and colleges which use big data analysis can make large positive differences to
the education system, its employees and students. By analysing big data, schools are supplied with the
intel needed to implement a better system for evaluation and support of teachers, to make sure students
are progressing and identifying at risk pupils.

5.) Government The Government has a large scope to make change to the community we live in as a
whole when utilising big data, such as dealing with traffic congestion, preventing crime, running
agencies and managing utilities. Governments however need to address the issues of privacy and
transparency.

6.) Health Care Health Care is one industry where lives could be at stake if information isn’t quick,
accurate and in some cases, transparent enough to satisfy strict industry regulations. When big data is
analysed effectively, health care providers can uncover insights that can find new cures and improve the
lives of everyone.

5. How do industries use big data?


Big data has been used in the industry to provide customer insights for transparent and simpler
products, by analyzing and predicting customer behavior through data derived from social media, GPS-
enabled devices, and CCTV footage.

7. How big is the big data industry?


The global big data analytics market size was valued at USD 271.83 billion in 2022. The
market is projected to grow from USD 307.52 billion in 2023 to USD 745.15 billion by 2030, exhibiting
a CAGR of 13.5% during the forecast period.

Web analytics

helps you understand more about user demographics, goals, and behavior, so you can tailor your
website's content and product offerings to their specific needs.

The objective of web analytics is to serve as a business metric for promoting specific products to the
customers who are most likely to buy them and to determine which products a specific customer is most
likely to purchase. This can help improve the ratio of revenue to marketing costs.

Web analytics refers to the process of collecting website data and then processing, reporting,
and analyzing it to create an online strategy for improving the website experience.

Web analytics tools are software designed to track, measure, and report on website activity
including site traffic, visitor source, and user clicks. Using web analytics tools helps you understand
what's happening on your website and get insights on what's working (and what's not).
1. Big Data in Education Industry : The education industry is flooding with massive amounts of data.
The data is generally related to students, faculty, courses, results, and whatnot. The proper insights into
these data can be used to improve the operational effectiveness and working of education institutes.

E-learning brings such a revolution in the education system. E-learning is a systematic and
properly managed platform for education.

a. Course Material Reframing


It involves a reframing of course material according to the data that is collected based on what
students want to learn. It is also done on the basis of real-time monitoring of components, which
ensures that for how long the course is beneficial for the students.
b. Customized and Dynamic Learning Programs
Customized learning programs can be developed according to individual students. This can be
executed by analyzing the data of each student’s learning history. This would contribute to improving
the overall results of students.
c. Better Grading Systems
New advancement in the grading system is introduced, which is based on proper analysis of
student’s data.
d. Career Prediction
Appropriate analysis and study of every student’s record are appropriately maintained. It will
help to understand each student’s progress, weakness, strengths, interests, and more. It would also help
to decide which career would be suitable for students in the future.
2. Big Data in Government Sector

The government sector generally has to deal with massive amounts of data on an almost daily
basis. They usually keep track of various records and databases of the cities, states, growth, energy
resources, geographical surveys, etc. Big data helps in proper analysis and study of this data. It helps the
government in endless ways.

a. To identify areas that need immediate attention.


b. In making decisions regarding various political programs.
c. To overcome national challenges such as terrorism, unemployment, energy resource exploration, etc.
d. Big Data is used in catching tax evader

For example:
The Food and Drug Administration (FDA), which runs under the jurisdiction of the Federal
Government of the USA, uses Big data to analyze and discover patterns and associations. Using this
analysis, they identify and examine the expected or unexpected occurrences of food-based infections.

3. Big Data in Healthcare Industry


Big data contributes a lot to the Healthcare Sector. Majorly it reduces the costs of a treatment
since there are fewer chances of having to perform unnecessary diagnoses.
Some of the best uses of Big data in the Healthcare Sector are:

a. It helps to detect the diseases by identifying them in the early stages. It prevents the disease from
getting any worse. This, in turn, makes the treatment easy and effective.
b. It helps to predict the outbreak of an epidemic and detect the preventive measures that need to be
taken to minimize the effects of the same.
c. It helps in research on past medical results so that patients can be provided with better services and
medicines.
For Example: Apple has come up with Apple HealthKit, CareKit, and ResearchKit. The aim is to
empower the iPhone users to store and access their real-time health records in their cellphones.

4. Big Data in Weather Patterns


Weather sensors and satellites are deployed all around the globe. And massive amounts of data
are collected from those sensors. These data are used to monitor the weather and environmental
conditions.

All the data collected contribute to Big data and can be used in different ways, such as:
a. To study global warming.
b. In weather forecasting.
c. For preparations in the case of crisis.
d. To understand the patterns of natural disasters etc.

For Example: IBM Deep Thunder provides weather forecasting through high-performance computing
of Big data.
5. Big Data in Banking Sector
According to GDC prognosis, data in the Banking Sector is estimated to grow by 700 percent
by the end of the next year. Proper analysis and study of this data help detect any and all illegal
activities that are being carried out, such as:
a. Venture credit hazard treatment. b. Misuse of credit/debit cards.
c. Customer statistics alteration. d. Risk mitigation.
e. Money laundering.
For Example:
Various software such as SAS AML (anti-money laundering) use Data Analytics to detect
suspicious transactions and analyze customer data.

6. Big Data in the Media and Entertainment Industry


Today, people have access to various digital gadgets, and due to this, there is a generation of a
large amount of data. Thus, there is a need for Big data in the media and entertainment industry. Social
Media also contributes to the generation of lots of data. Big data helps to handle and maintain all this
data.

The benefits extracted from big data in the media and entertainment industry are:

a. On-demand scheduling of media streams in digital media distribution platforms.


b. Predicting audience interests.
c. Getting insights from customer reviews.
d. Targeting of the advertisements.
For Example:
Spotify is an on-demand music platform that uses Big Data Analytics to collect data from users.
And it suggests music to customers according to their individual interests.

7. Big Data in Transportation Industry


Big data contributes to making transportation easier and efficient. Some of the transportation
areas which get benefited due to Big data are:

a. Real-time estimation of congestion and traffic control patterns. For example – we are generally using
Google Maps to locate the least traffic-prone routes.
b. Big data is used to understand and estimate the user’s needs on different routes and then utilize route
planning to reduce wait time.
c. Predictive analysis and Real-time processing of Big data help to identify the accident-prone areas. It
helps to reduce accidents and increase the safety level of traffic.
For Example:
Uber generally generates and uses a vast amount of data regarding drivers, vehicles, locations,
etc. All this data is analyzed and used to predict supply, location, demand, fares that need to be set for
every trip, etc.

8. Big Data in Agriculture Sector


Big data is playing an important role in enhancing the performance of the agriculture sector.
The main aim is to minimize the loss and increase the generation of necessary food grains. Big data has
helped a lot to introduce the digital and futuristic methods of the existing agricultural traditions.

Some of the best uses of Big data in agriculture are:


a. Automating the watering system helps farmers to concentrate on other more critical factors.
b. Big data can maintain and analyze the data from past years and can suggest the pesticides that work
best under certain conditions.
c. Big data smart technologies are collecting data directly from the fields; advanced algorithms are used
for predicting the soil condition, weather condition. Based on these analyses, agricultural activities are
carried out in fields.
9. Big Data in the Fast-Food Industry
Undoubtedly fast food is the most popular food choice all around the world. There are many
popular fast-food companies like McDonald’s, Pizzahut, KFC, etc. These companies are implementing
Big data to remain at the top. Some of the best uses of Big data in Fast food industry are:

a. Predicting the number of customers at the specific time of the day so that they can deliver the
food according to demand.
b. Collecting information about customer’s interests that helps them to design marketing
strategies and follow trends.
c. When there are more customers at the food counter, Big data applications show only those
items on the menu card or screen, which can be prepared within a short time.
d. Evaluating the performance of a branch and then fixing the locations where the new outlets
or offices should be opened in order to increase the profit.

10. Big Data in the Retail Industry


Big data gives an opportunity to the retail sector by analysis of the competitive marketplace and
customer interest. It determines customer engagement and customer satisfaction by collecting
multifarious data.

Study how the performance and efficiency of the retail sector get improved by Big data:

a. It helps to increase customer intimacy and engagement by collecting information on


customer’s spending patterns.
b. With predictive analysis, the industry can compare and maintain the supply-demand ratio.
c. The range of products, according to customer demand, can be determined. It also helps to set
new business strategies for improvement.
d. Transactional data, social media data, and weather forecasting help to assure the updated
condition of the current situation. And it helps to determine that the right product is available
at the right time.
11. Big Data in Telecommunication
With the increase in the amount of data passing through different channels of communication, it
becomes essential to properly collect and manage it. Proper handling of data will lead to profit
maximization, effective strategies for companies, etc.

Big data in telecommunication helps to visualize the transferred data that helps to make better
management and customer satisfaction.

a. Visualizing techniques that are defined by algorithms can detect fraud groups who illegally
access the system through fake profiles.
b. Big data help telecommunication companies to differentiate target audiences and add policies
according to customer segmentation.
c. Predictive analytics helps to get customer feedback that will help to analyze customer
preferences and interests.
d. Analysis of the current network management and customer engagement rate becomes easy;
this information will also help to get to know the area of improvement.
12. Big Data in the Airline Industry
Big data has its best utilization in the airline industry as it provides them with minute-to-minute
operational data. It helps them to gather information about customer service, weather forecasts,
ticketing, etc. It also helps to take decisions for customer satisfaction and to meet demands.
Let us look at some best uses of Big data for the airline industry:

a. Smart maintenance of aircraft by comparing operating costs, fuel quantity, and costs, etc.
b. Improving the safety security of flights by capturing flight-related data, incident data, and it
also strengthens aviation chain links.
c. Enhancing customer service and customer’s buying habits by analyzing past information.
Example- helps to check the real-time baggage status of customers.
d. Determining air traffic control, in-flight telemetry data information to have a comfortable
flight.
13. Big Data in Disaster Management
Almost every year, natural calamities like floods, hurricanes, earthquakes cause colossal
damage to land, animals, humans, etc. And many times, Scientists are not able to predict the possibility
of disaster. But the introduction of Big data in disaster management provides a new dimension for the
betterment of things.

The recent development of artificial intelligence (AI), data mining, and visualization are
helping meteorologists to forecast weather conditions more accurately. Some of the uses of Big data in
disaster management are:

a. It helps the government to take necessary actions to reduce the adverse effects of natural disaster.
b. Analyzing data collected from satellites and radar that helps in examining the weather conditions
every 12 hours.
c. Identifying the possibility of disasters by evaluating temperature, water level, wind pressure, and
other related factors.
d. It involves clustering techniques, visualization, streamflow simulation, and association rules that can
generate highly accurate results.

14. Big Data to Ensure National Security


Big data plays a vital role in ensuring national security. Many police forces use Big data to
improve their workflow and operations all around the world.
Almost all developed countries implemented Big data in their social and security activities a
long time ago. And it is good to know that underdeveloped countries have also started receiving the
benefits of using Big data.

Let us go through some of the Big data benefits:

a. The government collects all the information about citizens of its nation, and this data is
further maintained in a proper database.
b. It helps in evaluating the population’s density in a specific location and identifying the
possible threatening situations even before anything occurred.
c. Security officers use the maintained database to find any criminal or to detect any fraudulent
activities in any area of the country.
d. It also helps to predict the potential outspread of any virus or diseases and to take the
necessary actions to prevent it.
15. Big Data in Ecommerce
E-Commerce is one of the best ways through which people can earn online. Ecommerce enjoys
the benefits of operating online but also faces many challenges to achieve business objectives.
Big data in e-commerce provides a competitive advantage by delivering insights and analytical
reports. Let’s look at some more benefits of Big data in e-commerce:

a. Collects the data and customer requirements even before the official operation has started.
b. Creating a high-performance marketing model and setting a startup according to current
requirements and trends.
c. Identification of the most viewed products and the pages that appeared the maximum number
of times can be easily done. This results in further enhancement of the e-commerce business.
d. Most importantly, it helps to evaluate the customer’s behavior and suggests the products
according to their interests. It helps to increase the number of sales and generate revenue.
e. Big data applications can create a report depending on the visitor’s age, gender, location, etc.
Top Big Data Technologies
We can categorize the leading big data technologies into the following four sections:
o Data Storage
Hadoop - When it comes to handling big data, Hadoop is one of the leading
technologies that come into play. This technology is based entirely on map-reduce architecture
and is mainly used to process batch information. Also, it is capable enough to process tasks in
batches. The Hadoop framework was mainly introduced to store and process data in a
distributed data processing environment parallel to commodity hardware and a basic
programming execution model.
Cassandra- Cassandra is one of the leading big data technologies among the list of top NoSQL
databases.

MongoDB - MongoDB is another important component of big data technologies in terms of


storage. No relational properties and RDBMS properties apply to MongoDb because it is a
NoSQL database. This is not the same as traditional RDBMS databases that use structured
query languages. Instead, MongoDB uses schema documents.

RainStor - RainStor is a popular database management system designed to manage and analyze
organizations' Big Data requirements. It uses deduplication strategies that help manage storing
and handling vast amounts of data for reference

Hunk - Hunk is mainly helpful when data needs to be accessed in remote Hadoop clusters
using virtual indexes. This helps us to use the spunk search processing language to analyze
data. Also, Hunk allows us to report and visualize vast amounts of data from Hadoop and
NoSQL data sources.
o Data Mining : Presto is an open-source
RapidMiner is defined as the data science software that offers us a very robust and
powerful graphical user interface to create, deliver, manage, and maintain predictive analytics
o Data Analytics
Apache Kafka is a popular streaming platform. This streaming platform is primarily
known for its three core capabilities: publisher, subscriber and consumer. It is referred to as a
distributed streaming platform. It is also defined as a direct messaging, asynchronous
messaging broker system that can ingest and perform data processing on real-time streaming
data. This platform is almost similar to an enterprise messaging system or messaging queue.
plunk is known as one of the popular software platforms for capturing, correlating, and
indexing real-time streaming data in searchable repositories. Splunk can also produce graphs,
alerts, summarized reports, data visualizations, and dashboards, etc.
Apache Spark is one of the core technologies in the list of big data technologies. It is
one of those essential technologies which are widely used by top companies.
o Data Visualization
Tableau is one of the fastest and most powerful data visualization tools used by leading
business intelligence industries. It helps in analyzing the data at a very faster speed. Tableau
helps in creating the visualizations and insights in the form of dashboards and worksheets.
Plotly is best suited for plotting or creating graphs and relevant components at a faster speed in
an efficient way.
8. What is Hadoop : Hadoop is an open source framework from Apache and is used to store process
and analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo,
Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the
cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of
that HDFS was developed. It states that the files will be broken into blocks and stored in nodes
over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation
on data using key value pair. The Map task takes input data and converts it into a data set which
can be computed in Key value pair. The output of Map task is consumed by reduce task and
then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop
modules.

Hadoop Architecture:

The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.

Hadoop Distributed File System: The Hadoop Distributed File System (HDFS) is a distributed file
system for Hadoop. It contains a master/slave architecture. This architecture consist of a single
NameNode performs the role of master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run the
NameNode and DataNode software.

NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and
closing the files.
o It simplifies the architecture of the system.

DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.

Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by
using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can
also be called as a Mapper.

MapReduce Layer: The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task
Trackers. Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is
rescheduled.

Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing the
processing time. It is able to process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really
cost effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network,
so if one node is down or some other network failure happens, then Hadoop takes the other
copy of data and use it. Normally, data are replicated thrice but the replication factor is
configurable.

History of Hadoop: The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin
was the Google File System paper, published by Google.

9. What is open source technology?


Open source technology means that its source code is freely available to use, modify, and redistribute.
Linux: The success behind Linux it is a handshake with GNU software as GNU/Linux. Linus Torvalds
announced that he was creating an OS kernel based on Minix back in 1991.Majority of web servers run
Linux, and with Ubuntu which finally started to make inroads into the desktop market, and sooner will
be strong player in the mobile market with Android (which uses the Linux kernel).

Ubuntu: Successfully launched in the year 2004, Ubuntu is the most popular Linux distribution in
today’s world, especially on the desktop side. Ubuntu is an open source software platform which runs
almost in every aspect from IoT devices, the Smartphone, the tablet and the PC to the server and the
cloud.
MySQL
 MySQL is the most widely and highly popular Open Source Relational SQL database
management system used for database server in the world
 Used for the popular websites and services which includes Wikipedia, Facebook and, more
modestly, our very own Pingdom.com…). It is used LAMP stack (Linux, Apache, MySQL and
PHP).
 MySQL is one of the best RDBMS used for developing web-based software applications.

Want to grab open source database job? Get enrolled in PHP and MySQL training course today!
Apache: The Apache HTTP Server the most popular in the field of web server software since 1996
when it got started. Apache is still famous for its strong lead, outclassing and been the second runner up
from IIS in terms of number of deployed websites (according to Netcraft, whereas Apache is currently
used by 46% of all websites, while IIS is used by 29%). In 2009 it become the first web server to be
used by more than 100 million websites

Java: Java technologies are a high-level programming language originally developed by Sun
Microsystems and have powered enterprise applications since 1995. Java is a source for more than 800
million PCs,more than two billion handheld devices and 3.5 billion smart cards is been implemented
host of set-top boxes, Web cams, games, medical devices and much more. Major companies such as
Oracle (Java’s new owner) and IBM where Java is consider a technology to watch and embrace for all
levels of enterprise use.
Why going for Open Source technology?
Seeing the sudden boom in open source technologies, you must be thinking “why the companies are
increasingly adopting technologies instead of proprietary ones?”. Here are the reasons behind this-

 Cutting down the cost- Though companies are investing heavily in acquiring the latest and
advanced technologies, but saving the cost has always remained a concern for the companies.
Open source technologies relieved the firms from spending money for this purpose. These are
available free of cost an ready to be deployed which saves significant amount of money.
 Quality improvement- Developers believe that the source code can be customized or modified
as and when required which was not possible initially. Whether it is bug removal or a
modification, changing the code improves the quality of technology.
 Agile business- With the increasing speed of data, it has become imperative for the companies
to become instant responsive. Adopting open-source technologies is one way of achieving that.
These software speed up the development process as it is rather quicker to debug an existing
code than writing a new one.
 Mitigate business risk- Technology is changing rapidly causing it hard for the companies to
rely on a single vendor or multiple vendors of software. Using open source software stack is the
solution to this problem as its transparency allows the companies to build a vibrant community
and in turn lessens the business risk.

10. What do you mean by mobile business intelligence?


Mobile business intelligence is a technology-enabled process of extracting meaningful insights from
data and delivering them to end-users via mobile devices. Mobile BI users can conduct data analysis in
real time using smartphones, tablets, and wearables to make quick data-driven decisions.
11. What is a mobile example?
Typical examples include smartphones, tablets, laptop computers, smart watches, e-readers, and
handheld gaming consoles. We can expect a few new mobile trends to evolve and grow over time such
as progressive web apps, mobile-first websites, and AI-based services.

12. How to create mobile business?


The steps to starting a mobile business are the same as starting any type of business.
1. Develop a business plan.
2. Define a budget.
3. Pick a business name.
4. Get a virtual office.
5. Apply for the necessary permits and licenses.
6. Register your business.
7. Get a vehicle.
8. Hire employees if necessary.

13. What is crowdsourcing analytics in big data?


By crowdsourcing data, businesses can receive feedback and input from a large number of users
in a short amount of time. This can be used to improve products and make them more user-friendly.
Additionally, data crowdsourcing can be used to understand customer sentiment and track product
performance.
14. What Is Crowdsourcing?
Crowdsourcing involves obtaining work, information, or opinions from a large group of
people who submit their data via the Internet, social media, and smartphone apps.

15. What Are the Main Types of Crowdsourcing?


Crowdsourcing involves obtaining information or resources from a wide swath of people. In
general, we can break this up into four main categories:

 Wisdom - Wisdom of crowds is the idea that large groups of people are collectively smarter
than individual experts when it comes to problem-solving or identifying values (like the
weight of a cow or number of jelly beans in a jar).
 Creation - Crowd creation is a collaborative effort to design or build something. Wikipedia
and other wikis are examples of this. Open-source software is another good example.
 Voting - Crowd voting uses the democratic principle to choose a particular policy or course of
action by "polling the audience."
 Funding - Crowdfunding involved raising money for various purposes by soliciting relatively
small amounts from a large number of funders.

16. What do you mean by inter and trans fire wall analytics.
Inter-firewall analytics
 Focus: Analyzes traffic flows between different firewalls within a network.
 Methodology: Utilizes data collected from multiple firewalls to identify anomalies and potential
breaches.
 Benefits: Provides a comprehensive view of network traffic flow and helps identify lateral
movement across different security zones.
 Limitations: Requires deployment of multiple firewalls within the network and efficient data
exchange mechanisms between them.

Trans-firewall analytics
 Focus: Analyzes encrypted traffic that traverses firewalls, which traditional security solutions may
not be able to decrypt and inspect.
 Methodology: Uses deep packet inspection (DPI) and other advanced techniques to analyze the
content of encrypted traffic without compromising its security.
 Benefits: Provides insight into previously hidden threats within encrypted traffic and helps detect
sophisticated attacks.
 Limitations: Requires specialized hardware and software solutions for DPI, and raises concerns
regarding potential data privacy violations.
Difference between inter and trans fire:

Feature Inter-Firewall Analytics Trans-Firewall Analytics

Network traffic flow between


Focus Content of encrypted traffic
firewalls

Methodolog Analyzes data from multiple Uses DPI and other techniques to
y firewalls analyze encrypted traffic

Comprehensive view of network Detects threats within encrypted


Benefits
traffic, identifies lateral movement traffic

Requires specialized hardware


Requires multiple firewalls and
Limitations and software, raises privacy
efficient data exchange
concerns

You might also like