0% found this document useful (0 votes)
20 views16 pages

IOTBDM - Mid Sem

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views16 pages

IOTBDM - Mid Sem

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

what is big data and its features

Big Data refers to extremely large datasets that are difficult to process using traditional data processing tools. These
datasets are characterized by their volume, velocity, variety, veracity, and value.

Features of Big Data analytics:

1. Scalability: Big Data analytics tools and platforms must be able to handle large datasets and scale to meet
growing data volumes.
2. Flexibility: They must be able to handle a variety of data types, including structured, semi-structured, and
unstructured data.
3. Performance: Big Data analytics tools must be able to process data quickly and efficiently, even when
dealing with large datasets.
4. Accuracy: The results of Big Data analytics must be accurate and reliable.
5. Integration: Big Data analytics tools should be able to integrate with other enterprise applications and
systems.
6. Usability: They should be easy to use, even for users who are not data scientists.
7. Visualization: Big Data analytics tools should be able to visualize data in a way that is easy to understand
and interpret.
8. Predictive analytics: Big Data analytics can be used to predict future trends and outcomes.
9. Prescriptive analytics: Big Data analytics can be used to recommend specific actions based on data
analysis.
10. Real-time analytics: Big Data analytics can be used to process and analyze data in real time.

The "5 Vs" of big data are a framework

1. Volume: This refers to the sheer amount of data generated. Big data sets are typically very
large, often measured in terabytes or petabytes.
2. Velocity: This refers to the speed at which data is generated and processed. Big data sets
are often generated in real time or near-real time, requiring systems that can handle high data
throughput.
3. Variety: This refers to the different types of data that are included in a big data set. Big data
sets can include structured data (like data in databases), semi-structured data (like XML or
JSON files), and unstructured data (like text documents, images, or audio files).
4. Veracity: This refers to the quality and accuracy of the data. Big data sets can be noisy or
incomplete, making it difficult to extract meaningful insights.
5. Value:(worth of the data) This refers to the potential value that can be extracted from the
data. Big data sets can provide valuable insights into business operations, customer behavior,
and other areas.
four main types of data analytics
1. Descriptive Analytics: This is the most basic type of data analysis, providing a summary of past data.
It helps to understand what has happened in the past by answering questions like "What happened?" and
"What is the current situation?" Examples of descriptive analytics include creating reports, dashboards,
and visualizations.

2. Diagnostic Analytics: This type of analysis digs deeper into the data to understand why something
happened. It helps to identify the root causes of problems or successes by answering questions like "Why
did this happen?" and "What factors contributed to this outcome?" Examples of diagnostic analytics
include trend analysis, correlation analysis, and data mining.

3. Predictive Analytics: This type of analysis uses historical data to predict future outcomes. It helps to
anticipate future trends and make informed decisions by answering questions like "What will happen?"
and "What is the probability of this event occurring?" Examples of predictive analytics include forecasting,
time series analysis, and machine learning.

4. Prescriptive Analytics: This is the most advanced type of data analysis, providing recommendations
for future actions based on past data and predictions. It helps to optimize decision-making by answering
questions like "What should we do?" and "What is the best course of action?" Examples of prescriptive
analytics include optimization, simulation, and decision automation.

how IoT and big data work together

1. Data Generation: IoT devices constantly generate vast amounts of data, including sensor
readings, location information, and other relevant metrics. This data is often unstructured or
semi-structured, making it difficult to analyze using traditional data processing methods.

2. Data Collection: IoT platforms and gateways collect this data from the devices and store it in
data repositories. These repositories can be cloud-based or on-premises, depending on the
specific requirements of the application.
(Hadoop architecture)

3. Data Processing: Big data analytics tools and techniques are used to process and analyze the
collected data. These tools can handle large volumes of data and extract valuable insights that
would be impossible to obtain using traditional methods.

4. Data Analysis: By analyzing the data, organizations can gain insights into various aspects of
their operations, such as customer behavior, equipment performance, and supply chain efficiency.
This information can be used to make data-driven decisions and improve overall performance.
● Descriptive Analytics
● Diagnostic Analytics
● Predictive Analytics
● Prescriptive Analytics

5. Reporting of data
Applications of IoT and Big Data Management
(IoTBDM) in Different Sectors
Manufacturing

● Predictive Maintenance: IoT sensors can monitor equipment health in real time, predicting
failures before they occur. This reduces downtime and maintenance costs.
● Quality Control: IoT devices can track product quality throughout the manufacturing process,
ensuring compliance with standards.
● Supply Chain Optimization: IoT can track inventory levels, optimize transportation routes, and
improve supply chain visibility.

Healthcare

● Remote Patient Monitoring: IoT devices can monitor patients' vital signs, enabling remote care
and early detection of health issues.
● Healthcare Analytics: Big data analysis can identify trends in patient data, improve treatment
outcomes, and optimize resource allocation.
● Smart Hospitals: IoT-enabled infrastructure can optimize energy consumption, improve patient
safety, and enhance operational efficiency.

Agriculture

● Precision Agriculture: IoT sensors can monitor soil moisture, temperature, and other
environmental factors,enabling farmers to optimize resource usage and improve crop yields.
● Livestock Monitoring: IoT devices can track livestock health, location, and behavior, improving
animal welfare and productivity.
● Supply Chain Management: IoT can track food products from farm to table, ensuring food safety
and traceability.

Transportation

● Smart Cities: IoT-enabled infrastructure can optimize traffic flow, improve public transportation,
and reduce congestion.
● Autonomous Vehicles: IoT sensors and data analytics can enable autonomous vehicles to
navigate safely and efficiently.
● Fleet Management: IoT can track vehicle location, fuel consumption, and maintenance,
improving fleet efficiency and reducing costs.

Retail

● Inventory Management: IoT can track inventory levels in real time, preventing stockouts and
overstocking.
● Customer Analytics: IoT devices can collect data on customer behavior, preferences, and
in-store experiences,enabling retailers to personalize marketing and improve customer
satisfaction.
https://youtu.be/8r7kHT4K1pA?si=5P-IRBw0S4w_kb4s

HDFS
HDFS (Hadoop Distributed File System) is the storage system used by Hadoop to store large datasets across
multiple nodes in a distributed environment. It is designed to handle large amounts of data with high fault
tolerance, scalability, and performance. Here's an overview of its key components and how the architecture
works:

Core Components of HDFS Architecture:


1. NameNode:
○ The master node responsible for managing the metadata of the file system, such as the
directory structure and locations of blocks on the DataNodes.
○ It maintains information like:
■ File names and directories.
■ Block locations.
■ Permissions, ownership, and other file attributes.
○ The NameNode does not store actual data but keeps track of where data blocks are stored
across the cluster.
2. DataNode:
○ Slave nodes that store the actual data blocks on the Hadoop cluster.
○ Each file in HDFS is split into blocks, and each block is stored across multiple DataNodes.
○ DataNodes periodically send heartbeats and block reports to the NameNode to inform it that
they are alive and to provide the status of their data blocks.
○ If a DataNode fails, the NameNode can replicate blocks to maintain data availability.
3. Secondary NameNode:
○ Not a backup for the NameNode, but instead helps the NameNode manage its metadata and
prevents it from becoming overloaded.
○ Periodically merges the edit logs (transaction logs of metadata changes) with the fsimage
(snapshot of the file system metadata) to create a new fsimage, which reduces the size of the
edit logs.
○ If the NameNode fails, the Secondary NameNode helps in recovery by providing a recent
copy of the metadata.
Fig: HDFS architecture

Imagine HDFS as a giant library.

● The NameNode is like the librarian who keeps track of all the books and where they're
located.
● The fsimage is like a list of all the books in the library.
● The edits log is like a notebook where the librarian writes down when new books arrive,
old books are removed, or books are moved to different shelves.

The NameNode and Secondary NameNode work together to make sure the library (HDFS) is
always organized and up-to-date. They create a list of all the books (fsimage) and write down any
changes to the list (edits log). If the main librarian (NameNode) gets sick, the backup librarian
(Secondary NameNode) can take over and use the list and notes to keep the library running
smoothly.

fsimage is like a snapshot of the library (HDFS) at a particular point in time. It contains information
about all the books (files and directories) in the library and where they are located.

edits log is like a journal where changes to the library are recorded. Whenever a new book is
added, an old book is removed, or a book is moved to a different shelf, it is written down in the edits
log.
fsimage and edits log work together:

1. Periodic Checkpointing: The NameNode (the librarian) periodically creates a new fsimage
(a list of all the books).
2. Edits Log Updates: Whenever a change is made to the library (a book is added, removed,
or moved), the NameNode writes it down in the edits log.
3. Synchronization: A backup librarian (Secondary NameNode) regularly checks with the main
librarian (NameNode) to get the latest list of books (fsimage) and any changes (edits log).
4. Failover: If the main librarian (NameNode) gets sick or goes on vacation, the backup
librarian (Secondary NameNode) can take over and use the list of books (fsimage) and the
changes (edits log) to keep the library running smoothly

Benefits of using fsimage and edits log:

● Data Consistency: Ensures that the library (HDFS) is always organized and up-to-date.
● Fault Tolerance: If the main librarian (NameNode) fails, the backup librarian (Secondary
NameNode) can take over without losing any information.
● Efficiency: By periodically creating fsimages, the NameNode can reduce the amount of
work it has to do to keep track of all the books.

Rack awareness in HDFS is like knowing where the books are located in a library. It helps the
librarian (NameNode) place the books (data) on shelves (DataNodes) that are close together to
make it easier and faster for people (clients) to find the books they need.

Rack awareness in HDFS (Hadoop Distributed File System) is a feature that allows the system to be
aware of the physical location of DataNodes within a cluster. This information is used to optimize
data placement and network traffic,improving the overall performance and efficiency of the HDFS
cluster.

How rack awareness works:

1. Rack Identification: Each DataNode is assigned a rack ID that identifies the physical rack
where it is located.
2. Rack Topology: The NameNode maintains a map of the rack topology of the cluster, which
includes information about the racks and the DataNodes located on each rack.
3. Data Placement: When placing data blocks, the NameNode considers the rack topology to
ensure that blocks are replicated across multiple racks. This helps to reduce the risk of data
loss due to a single rack failure and improves network performance.
4. Read Optimization: When reading data, the NameNode chooses DataNodes that are on the
same rack as the client or on a different rack that is connected by a high-speed network link.
This reduces the amount of network traffic and improves read performance.
Benefits of rack awareness:

● Improved Data Availability: By replicating data across multiple racks, rack awareness
helps to ensure that data is available even if a single rack fails.
● Reduced Network Traffic: Rack awareness can reduce network traffic by placing data
blocks on DataNodes that are close to the client.
● Improved Performance: Rack awareness can improve the overall performance of HDFS by
optimizing data placement and network traffic.

Read and Write Operations in HDFS


In HDFS (Hadoop Distributed File System), read and write operations are the fundamental
methods for interacting with data stored on the cluster.

Read Operation:
1. Client Request: A client application initiates a read operation by specifying the file or
directory to be read.
2. NameNode Lookup: The NameNode determines the location of the data blocks that
make up the requested file.
3. Data Retrieval: The client directly reads the data blocks from the DataNodes that store
them.
4. Data Assembly: The client assembles the retrieved data blocks into the requested file.

Write Operation:
1. Client Request: A client application initiates a write operation by specifying the file or
directory to be written to.
2. NameNode Coordination: The NameNode determines the appropriate DataNodes to
store the data based on factors like replication policy and network topology.
3. Data Transmission: The client transmits the data to the designated DataNodes.
4. Data Replication: The DataNodes replicate the data to ensure redundancy and fault
tolerance.
Functions of NameNode, DataNode, and Secondary NameNode in HDFS:

NameNode

● Master Control: The NameNode is the central authority that manages the HDFS
namespace.
● Bookkeeping: It keeps track of all the files, directories, and data blocks in the system.
● Placement: It decides where data blocks should be stored on DataNodes.
● Client Requests: Handles requests from clients to read or write data.

DataNode

● Storage: DataNodes are like shelves that store data blocks.


● Replication: They make copies of data blocks to prevent loss.
● Retrieval: They send data blocks to clients when requested.
● Reporting: They tell the NameNode about their status and the data they're holding.

Secondary NameNode

● Backup: It's like a backup librarian who helps the main librarian (NameNode).
● Checkpoints: It creates copies of the NameNode's work to prevent data loss.
● Takeover: If the main librarian (NameNode) gets sick, the backup librarian (Secondary
NameNode) can take over.
● Load Balancing: It helps the main librarian by doing some of its work.

What is map reduce?


MapReduce is a programming model used to process large datasets across distributed systems.
It's like dividing a big task into smaller, manageable pieces and then combining the results.

How it works:

1. Splitting: The big task (data) is divided into smaller pieces (chunks).
2. Mapping: Each piece is processed by a map function, which turns it into smaller pieces
(key-value pairs).
3. Shuffling & Sorting: The smaller pieces are sorted and grouped together.
4. Reducing: A reduce function combines the grouped pieces into a final result.

MapReduce is a programming model used to process large datasets across distributed systems. It consists
of two main phases:
Map Phase
● Data Splitting: The input data is divided into smaller chunks.
● Map Function: Each chunk is processed by a map function, which transforms the data into
key-value pairs.
● Intermediate Data: The map functions produce intermediate key-value pairs.

Example:

● Input: A large text file.


● Map Function: Splits the text into words and emits a key-value pair for each word, where the key is
the word and the value is 1.

Reduce Phase
● Shuffle and Sort: The intermediate key-value pairs are shuffled and sorted based on their keys.
● Reduce Function: The sorted key-value pairs are grouped by key, and a reduce function is applied
to each group to combine the values.
● Final Output: The reduce phase produces the final output of the MapReduce job.

Example:

● Reduce Function: Combines the values for each word to count the total occurrences of that word.

Benefits of MapReduce:

● Scalability: Can handle massive datasets.


● Fault Tolerance: Resilient to hardware failures.
● Ease of Use: Provides a simple programming model for parallel processing.
● Flexibility: Can be used for a variety of data processing tasks.

Common Use Cases:

● Word Counting: Counting the frequency of words in a large text corpus.


● PageRank: Calculating the importance of web pages.
● Recommendation Systems: Suggesting products or services to users based on their preferences.
● Machine Learning: Training machine learning models on large datasets.
JobTracker and TaskTracker are key components of the Hadoop
MapReduce framework.

JobTracker
● Master Node: The JobTracker is the master node responsible for coordinating MapReduce jobs.
● Job Scheduling: It schedules Map and Reduce tasks across the cluster.
● Resource Allocation: It allocates resources (CPU, memory, etc.) to tasks.
● Task Monitoring: It monitors the progress of tasks and handles failures.

TaskTracker
● Slave Node: TaskTrackers are slave nodes that execute Map and Reduce tasks.
● Task Execution: They execute individual Map and Reduce tasks assigned by the JobTracker.
● Status Reporting: They report the status of their tasks to the JobTracker.
● Resource Management: They manage resources (CPU, memory) on their local nodes.

Relationship between JobTracker and TaskTracker:

● The JobTracker assigns tasks to TaskTrackers based on their availability and resource requirements.
● TaskTrackers execute tasks and report their status to the JobTracker.
● If a TaskTracker fails, the JobTracker can reschedule the failed tasks on other TaskTrackers.
Map Reduce is a programming model for processing large datasets in a parallel and distributed manner. It
consists of two main phases:

1. Map Phase: In this phase, the input data is divided into smaller chunks, and a mapping
function is applied to each chunk independently. The mapping function transforms the input
data into a set of key-value pairs.

2. Reduce Phase: In this phase, the output from the mapping phase is grouped by the keys,
and a reduce function is applied to the values associated with each key. The reduce function
combines the values and produces the final output.

Here are some real-time examples of how MapReduce can be used:

1. Web Crawler and Search Engine Indexing: When a web crawler collects web pages, the
MapReduce model can be used to process the crawled data and build an index for a search
engine. The map phase can extract the text and metadata from each web page, while the
reduce phase can aggregate the information and build the search index.
2. Log File Analysis: Large organizations often have massive amounts of log data generated
by their systems and applications. MapReduce can be used to analyze these logs and
extract valuable insights, such as identifying errors, tracking user behavior, or generating
usage reports.
3. Sentiment Analysis: MapReduce can be used to analyze social media data, such as
tweets or Facebook posts, to determine the sentiment (positive, negative, or neutral) of the
content. The map phase can extract the text and metadata, while the reduce phase can
aggregate the sentiment scores for each entity or topic.
4. Recommendation Systems: E-commerce websites and streaming platforms use
recommendation systems to suggest products or content to users based on their browsing
history and preferences. MapReduce can be used to process large amounts of user data,
create user profiles, and generate personalized recommendations.

5. Financial Data Analysis: Banks and financial institutions often need to analyze large
datasets, such as stock trades, transactions, or market data. MapReduce can be used to
process this data, detect patterns, and identify trends that can inform investment decisions
or risk management strategies.

The key advantage of MapReduce is its ability to scale and process large datasets efficiently by distributing the
workload across multiple machines. This makes it well-suited for handling massive amounts of data in a variety
of real-time applications.
YARN
What is YARN?

YARN (Yet Another Resource Navigator) is a core component of Hadoop introduced in Hadoop 2.x to
manage resources in a distributed computing environment. It functions as a resource management layer
for the Hadoop cluster, allowing multiple applications to share resources while improving the scalability
and efficiency of data processing

Need for YARN:

YARN was introduced to overcome the limitations of Hadoop’s original MapReduce framework, which
tightly coupled resource management and job execution. Some key needs for YARN are:

1. Scalability: As data sizes grew, there was a need for a more efficient resource management
system.
2. Resource Utilization: The earlier versions of Hadoop lacked fine-grained control over resources,
leading to underutilization.
3. Multi-tenancy: YARN enables multiple data-processing frameworks (not just MapReduce) to run
on a single cluster, making it more flexible.
4. Decoupling of Resource Management and Scheduling: YARN separates resource
management and job scheduling, allowing better control and scalability.

Advantages of YARN:

1. Scalability: YARN allows Hadoop to scale to larger clusters by efficiently managing resources
across different nodes.
2. Resource Efficiency: It optimizes the use of cluster resources by dynamically allocating
resources to tasks based on the needs of applications.
3. Flexibility: YARN supports various data-processing frameworks (MapReduce, Spark, Flink),
making Hadoop more versatile.
4. Multi-tenancy: Multiple applications can share the cluster resources without interfering with each
other.
5. Improved Performance: By decoupling resource management from job scheduling, YARN
provides a more efficient way to manage workloads.
6. Fault Tolerance: YARN improves fault tolerance by isolating resource management and
application execution, so failures in one don’t affect the other.
Components of YARN Architecture:

1. Resource Manager (RM):


○ Central authority that manages and allocates resources across the entire Hadoop
cluster.
○ It has two key components:
■ Scheduler:
■ Allocates resources to various running applications based on
resource requirements.
■ Ensures fairness and capacity but does not monitor or track job
statuses.
■ Application Manager:
■ Manages the lifecycle of applications (such as MapReduce or
Spark jobs).
■ Tracks job progress, handles job failures, and helps restart jobs.
2. Node Manager (NM):
○ Per-node agent responsible for managing individual nodes within the cluster.
○ Manages the lifecycle of containers (which are resource bundles like CPU and
memory) on its node.
○ Monitors resource usage (CPU, memory, disk) and reports it to the Resource
Manager.
○ Responsible for launching containers and killing them when tasks are done or fail.
3. Application Master (AM):
○ Per-application agent that negotiates resources with the Resource Manager.
○ Manages the execution of tasks within the application by requesting containers
from the Resource Manager.
○ Each application (e.g., a MapReduce or Spark job) has its own dedicated
Application Master.
○ Coordinates between the Node Managers to execute tasks and monitors the
application's progress.
○ It is short-lived, running only for the duration of the application it is managing.
4. Containers:
○ Execution units that are allocated by the Resource Manager and managed by the
Node Manager.
○ Containers are bundles of resources (CPU, memory, disk, network) and are where
individual tasks run.
○ Tasks such as MapReduce jobs or Spark operations are executed within
containers.
YARN Workflow:
1. Application Submission:
○ A client submits an application (e.g., a MapReduce job) to the Resource Manager.
2. Resource Allocation for Application Master:
○ The Resource Manager allocates the first container for the Application Master of that
job.
3. Application Master Negotiation:
○ The Application Master negotiates with the Resource Manager for additional containers
where the job tasks will be executed.
4. Task Execution:
○ The Node Manager on each node manages containers and runs the tasks inside these
containers, following the instructions from the Application Master.
5. Job Monitoring and Completion:
○ The Application Master monitors the execution of tasks, handles retries or failures, and
reports back to the Resource Manager once the application completes.
6. Resource Release:
○ Once the job is finished, the Application Master notifies the Resource Manager, and all
allocated containers and resources are released back to the pool for other jobs.

You might also like