0% found this document useful (0 votes)

20 views16 pages

IOTBDM - Mid Sem

Uploaded by

poorvanshigupta30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views16 pages

IOTBDM - Mid Sem

Uploaded by

poorvanshigupta30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

what is big data and its features

Big Data refers to extremely large datasets that are difficult to process using traditional data processing tools. These
datasets are characterized by their volume, velocity, variety, veracity, and value.

Features of Big Data analytics:

1. Scalability: Big Data analytics tools and platforms must be able to handle large datasets and scale to meet
growing data volumes.
2. Flexibility: They must be able to handle a variety of data types, including structured, semi-structured, and
unstructured data.
3. Performance: Big Data analytics tools must be able to process data quickly and efficiently, even when
dealing with large datasets.
4. Accuracy: The results of Big Data analytics must be accurate and reliable.
5. Integration: Big Data analytics tools should be able to integrate with other enterprise applications and
systems.
6. Usability: They should be easy to use, even for users who are not data scientists.
7. Visualization: Big Data analytics tools should be able to visualize data in a way that is easy to understand
and interpret.
8. Predictive analytics: Big Data analytics can be used to predict future trends and outcomes.
9. Prescriptive analytics: Big Data analytics can be used to recommend specific actions based on data
analysis.
10. Real-time analytics: Big Data analytics can be used to process and analyze data in real time.

The "5 Vs" of big data are a framework

1. Volume: This refers to the sheer amount of data generated. Big data sets are typically very
large, often measured in terabytes or petabytes.
2. Velocity: This refers to the speed at which data is generated and processed. Big data sets
are often generated in real time or near-real time, requiring systems that can handle high data
throughput.
3. Variety: This refers to the different types of data that are included in a big data set. Big data
sets can include structured data (like data in databases), semi-structured data (like XML or
JSON files), and unstructured data (like text documents, images, or audio files).
4. Veracity: This refers to the quality and accuracy of the data. Big data sets can be noisy or
incomplete, making it difficult to extract meaningful insights.
5. Value:(worth of the data) This refers to the potential value that can be extracted from the
data. Big data sets can provide valuable insights into business operations, customer behavior,
and other areas.
four main types of data analytics
1. Descriptive Analytics: This is the most basic type of data analysis, providing a summary of past data.
It helps to understand what has happened in the past by answering questions like "What happened?" and
"What is the current situation?" Examples of descriptive analytics include creating reports, dashboards,
and visualizations.

2. Diagnostic Analytics: This type of analysis digs deeper into the data to understand why something
happened. It helps to identify the root causes of problems or successes by answering questions like "Why
did this happen?" and "What factors contributed to this outcome?" Examples of diagnostic analytics
include trend analysis, correlation analysis, and data mining.

3. Predictive Analytics: This type of analysis uses historical data to predict future outcomes. It helps to
anticipate future trends and make informed decisions by answering questions like "What will happen?"
and "What is the probability of this event occurring?" Examples of predictive analytics include forecasting,
time series analysis, and machine learning.

4. Prescriptive Analytics: This is the most advanced type of data analysis, providing recommendations
for future actions based on past data and predictions. It helps to optimize decision-making by answering
questions like "What should we do?" and "What is the best course of action?" Examples of prescriptive
analytics include optimization, simulation, and decision automation.

how IoT and big data work together

1. Data Generation: IoT devices constantly generate vast amounts of data, including sensor
readings, location information, and other relevant metrics. This data is often unstructured or
semi-structured, making it difficult to analyze using traditional data processing methods.

2. Data Collection: IoT platforms and gateways collect this data from the devices and store it in
data repositories. These repositories can be cloud-based or on-premises, depending on the
specific requirements of the application.
(Hadoop architecture)

3. Data Processing: Big data analytics tools and techniques are used to process and analyze the
collected data. These tools can handle large volumes of data and extract valuable insights that
would be impossible to obtain using traditional methods.

4. Data Analysis: By analyzing the data, organizations can gain insights into various aspects of
their operations, such as customer behavior, equipment performance, and supply chain efficiency.
This information can be used to make data-driven decisions and improve overall performance.
● Descriptive Analytics
● Diagnostic Analytics
● Predictive Analytics
● Prescriptive Analytics

5. Reporting of data
Applications of IoT and Big Data Management
(IoTBDM) in Different Sectors
Manufacturing

● Predictive Maintenance: IoT sensors can monitor equipment health in real time, predicting
failures before they occur. This reduces downtime and maintenance costs.
● Quality Control: IoT devices can track product quality throughout the manufacturing process,
ensuring compliance with standards.
● Supply Chain Optimization: IoT can track inventory levels, optimize transportation routes, and
improve supply chain visibility.

Healthcare

● Remote Patient Monitoring: IoT devices can monitor patients' vital signs, enabling remote care
and early detection of health issues.
● Healthcare Analytics: Big data analysis can identify trends in patient data, improve treatment
outcomes, and optimize resource allocation.
● Smart Hospitals: IoT-enabled infrastructure can optimize energy consumption, improve patient
safety, and enhance operational efficiency.

Agriculture

● Precision Agriculture: IoT sensors can monitor soil moisture, temperature, and other
environmental factors,enabling farmers to optimize resource usage and improve crop yields.
● Livestock Monitoring: IoT devices can track livestock health, location, and behavior, improving
animal welfare and productivity.
● Supply Chain Management: IoT can track food products from farm to table, ensuring food safety
and traceability.

Transportation

● Smart Cities: IoT-enabled infrastructure can optimize traffic flow, improve public transportation,
and reduce congestion.
● Autonomous Vehicles: IoT sensors and data analytics can enable autonomous vehicles to
navigate safely and efficiently.
● Fleet Management: IoT can track vehicle location, fuel consumption, and maintenance,
improving fleet efficiency and reducing costs.

Retail

● Inventory Management: IoT can track inventory levels in real time, preventing stockouts and
overstocking.
● Customer Analytics: IoT devices can collect data on customer behavior, preferences, and
in-store experiences,enabling retailers to personalize marketing and improve customer
satisfaction.
https://youtu.be/8r7kHT4K1pA?si=5P-IRBw0S4w_kb4s

HDFS
HDFS (Hadoop Distributed File System) is the storage system used by Hadoop to store large datasets across
multiple nodes in a distributed environment. It is designed to handle large amounts of data with high fault
tolerance, scalability, and performance. Here's an overview of its key components and how the architecture
works:

Core Components of HDFS Architecture:

1. NameNode:
○ The master node responsible for managing the metadata of the file system, such as the
directory structure and locations of blocks on the DataNodes.
○ It maintains information like:
■ File names and directories.
■ Block locations.
■ Permissions, ownership, and other file attributes.
○ The NameNode does not store actual data but keeps track of where data blocks are stored
across the cluster.
2. DataNode:
○ Slave nodes that store the actual data blocks on the Hadoop cluster.
○ Each file in HDFS is split into blocks, and each block is stored across multiple DataNodes.
○ DataNodes periodically send heartbeats and block reports to the NameNode to inform it that
they are alive and to provide the status of their data blocks.
○ If a DataNode fails, the NameNode can replicate blocks to maintain data availability.
3. Secondary NameNode:
○ Not a backup for the NameNode, but instead helps the NameNode manage its metadata and
prevents it from becoming overloaded.
○ Periodically merges the edit logs (transaction logs of metadata changes) with the fsimage
(snapshot of the file system metadata) to create a new fsimage, which reduces the size of the
edit logs.
○ If the NameNode fails, the Secondary NameNode helps in recovery by providing a recent
copy of the metadata.
Fig: HDFS architecture

Imagine HDFS as a giant library.

● The NameNode is like the librarian who keeps track of all the books and where they're
located.
● The fsimage is like a list of all the books in the library.
● The edits log is like a notebook where the librarian writes down when new books arrive,
old books are removed, or books are moved to different shelves.

The NameNode and Secondary NameNode work together to make sure the library (HDFS) is
always organized and up-to-date. They create a list of all the books (fsimage) and write down any
changes to the list (edits log). If the main librarian (NameNode) gets sick, the backup librarian
(Secondary NameNode) can take over and use the list and notes to keep the library running
smoothly.

fsimage is like a snapshot of the library (HDFS) at a particular point in time. It contains information
about all the books (files and directories) in the library and where they are located.

edits log is like a journal where changes to the library are recorded. Whenever a new book is
added, an old book is removed, or a book is moved to a different shelf, it is written down in the edits
log.
fsimage and edits log work together:

1. Periodic Checkpointing: The NameNode (the librarian) periodically creates a new fsimage
(a list of all the books).
2. Edits Log Updates: Whenever a change is made to the library (a book is added, removed,
or moved), the NameNode writes it down in the edits log.
3. Synchronization: A backup librarian (Secondary NameNode) regularly checks with the main
librarian (NameNode) to get the latest list of books (fsimage) and any changes (edits log).
4. Failover: If the main librarian (NameNode) gets sick or goes on vacation, the backup
librarian (Secondary NameNode) can take over and use the list of books (fsimage) and the
changes (edits log) to keep the library running smoothly

Benefits of using fsimage and edits log:

● Data Consistency: Ensures that the library (HDFS) is always organized and up-to-date.
● Fault Tolerance: If the main librarian (NameNode) fails, the backup librarian (Secondary
NameNode) can take over without losing any information.
● Efficiency: By periodically creating fsimages, the NameNode can reduce the amount of
work it has to do to keep track of all the books.

Rack awareness in HDFS is like knowing where the books are located in a library. It helps the
librarian (NameNode) place the books (data) on shelves (DataNodes) that are close together to
make it easier and faster for people (clients) to find the books they need.

Rack awareness in HDFS (Hadoop Distributed File System) is a feature that allows the system to be
aware of the physical location of DataNodes within a cluster. This information is used to optimize
data placement and network traffic,improving the overall performance and efficiency of the HDFS
cluster.

How rack awareness works:

1. Rack Identification: Each DataNode is assigned a rack ID that identifies the physical rack
where it is located.
2. Rack Topology: The NameNode maintains a map of the rack topology of the cluster, which
includes information about the racks and the DataNodes located on each rack.
3. Data Placement: When placing data blocks, the NameNode considers the rack topology to
ensure that blocks are replicated across multiple racks. This helps to reduce the risk of data
loss due to a single rack failure and improves network performance.
4. Read Optimization: When reading data, the NameNode chooses DataNodes that are on the
same rack as the client or on a different rack that is connected by a high-speed network link.
This reduces the amount of network traffic and improves read performance.
Benefits of rack awareness:

● Improved Data Availability: By replicating data across multiple racks, rack awareness
helps to ensure that data is available even if a single rack fails.
● Reduced Network Traffic: Rack awareness can reduce network traffic by placing data
blocks on DataNodes that are close to the client.
● Improved Performance: Rack awareness can improve the overall performance of HDFS by
optimizing data placement and network traffic.

Read and Write Operations in HDFS

In HDFS (Hadoop Distributed File System), read and write operations are the fundamental
methods for interacting with data stored on the cluster.

Read Operation:
1. Client Request: A client application initiates a read operation by specifying the file or
directory to be read.
2. NameNode Lookup: The NameNode determines the location of the data blocks that
make up the requested file.
3. Data Retrieval: The client directly reads the data blocks from the DataNodes that store
them.
4. Data Assembly: The client assembles the retrieved data blocks into the requested file.

Write Operation:
1. Client Request: A client application initiates a write operation by specifying the file or
directory to be written to.
2. NameNode Coordination: The NameNode determines the appropriate DataNodes to
store the data based on factors like replication policy and network topology.
3. Data Transmission: The client transmits the data to the designated DataNodes.
4. Data Replication: The DataNodes replicate the data to ensure redundancy and fault
tolerance.
Functions of NameNode, DataNode, and Secondary NameNode in HDFS:

NameNode

● Master Control: The NameNode is the central authority that manages the HDFS
namespace.
● Bookkeeping: It keeps track of all the files, directories, and data blocks in the system.
● Placement: It decides where data blocks should be stored on DataNodes.
● Client Requests: Handles requests from clients to read or write data.

DataNode

● Storage: DataNodes are like shelves that store data blocks.

● Replication: They make copies of data blocks to prevent loss.
● Retrieval: They send data blocks to clients when requested.
● Reporting: They tell the NameNode about their status and the data they're holding.

Secondary NameNode

● Backup: It's like a backup librarian who helps the main librarian (NameNode).
● Checkpoints: It creates copies of the NameNode's work to prevent data loss.
● Takeover: If the main librarian (NameNode) gets sick, the backup librarian (Secondary
NameNode) can take over.
● Load Balancing: It helps the main librarian by doing some of its work.

What is map reduce?

MapReduce is a programming model used to process large datasets across distributed systems.
It's like dividing a big task into smaller, manageable pieces and then combining the results.

How it works:

1. Splitting: The big task (data) is divided into smaller pieces (chunks).
2. Mapping: Each piece is processed by a map function, which turns it into smaller pieces
(key-value pairs).
3. Shuffling & Sorting: The smaller pieces are sorted and grouped together.
4. Reducing: A reduce function combines the grouped pieces into a final result.

MapReduce is a programming model used to process large datasets across distributed systems. It consists
of two main phases:
Map Phase
● Data Splitting: The input data is divided into smaller chunks.
● Map Function: Each chunk is processed by a map function, which transforms the data into
key-value pairs.
● Intermediate Data: The map functions produce intermediate key-value pairs.

Example:

● Input: A large text file.

● Map Function: Splits the text into words and emits a key-value pair for each word, where the key is
the word and the value is 1.

Reduce Phase
● Shuffle and Sort: The intermediate key-value pairs are shuffled and sorted based on their keys.
● Reduce Function: The sorted key-value pairs are grouped by key, and a reduce function is applied
to each group to combine the values.
● Final Output: The reduce phase produces the final output of the MapReduce job.

Example:

● Reduce Function: Combines the values for each word to count the total occurrences of that word.

Benefits of MapReduce:

● Scalability: Can handle massive datasets.

● Fault Tolerance: Resilient to hardware failures.
● Ease of Use: Provides a simple programming model for parallel processing.
● Flexibility: Can be used for a variety of data processing tasks.

Common Use Cases:

● Word Counting: Counting the frequency of words in a large text corpus.

● PageRank: Calculating the importance of web pages.
● Recommendation Systems: Suggesting products or services to users based on their preferences.
● Machine Learning: Training machine learning models on large datasets.
JobTracker and TaskTracker are key components of the Hadoop
MapReduce framework.

JobTracker
● Master Node: The JobTracker is the master node responsible for coordinating MapReduce jobs.
● Job Scheduling: It schedules Map and Reduce tasks across the cluster.
● Resource Allocation: It allocates resources (CPU, memory, etc.) to tasks.
● Task Monitoring: It monitors the progress of tasks and handles failures.

TaskTracker
● Slave Node: TaskTrackers are slave nodes that execute Map and Reduce tasks.
● Task Execution: They execute individual Map and Reduce tasks assigned by the JobTracker.
● Status Reporting: They report the status of their tasks to the JobTracker.
● Resource Management: They manage resources (CPU, memory) on their local nodes.

Relationship between JobTracker and TaskTracker:

● The JobTracker assigns tasks to TaskTrackers based on their availability and resource requirements.
● TaskTrackers execute tasks and report their status to the JobTracker.
● If a TaskTracker fails, the JobTracker can reschedule the failed tasks on other TaskTrackers.
Map Reduce is a programming model for processing large datasets in a parallel and distributed manner. It
consists of two main phases:

1. Map Phase: In this phase, the input data is divided into smaller chunks, and a mapping
function is applied to each chunk independently. The mapping function transforms the input
data into a set of key-value pairs.

2. Reduce Phase: In this phase, the output from the mapping phase is grouped by the keys,
and a reduce function is applied to the values associated with each key. The reduce function
combines the values and produces the final output.

Here are some real-time examples of how MapReduce can be used:

1. Web Crawler and Search Engine Indexing: When a web crawler collects web pages, the
MapReduce model can be used to process the crawled data and build an index for a search
engine. The map phase can extract the text and metadata from each web page, while the
reduce phase can aggregate the information and build the search index.
2. Log File Analysis: Large organizations often have massive amounts of log data generated
by their systems and applications. MapReduce can be used to analyze these logs and
extract valuable insights, such as identifying errors, tracking user behavior, or generating
usage reports.
3. Sentiment Analysis: MapReduce can be used to analyze social media data, such as
tweets or Facebook posts, to determine the sentiment (positive, negative, or neutral) of the
content. The map phase can extract the text and metadata, while the reduce phase can
aggregate the sentiment scores for each entity or topic.
4. Recommendation Systems: E-commerce websites and streaming platforms use
recommendation systems to suggest products or content to users based on their browsing
history and preferences. MapReduce can be used to process large amounts of user data,
create user profiles, and generate personalized recommendations.

5. Financial Data Analysis: Banks and financial institutions often need to analyze large
datasets, such as stock trades, transactions, or market data. MapReduce can be used to
process this data, detect patterns, and identify trends that can inform investment decisions
or risk management strategies.

The key advantage of MapReduce is its ability to scale and process large datasets efficiently by distributing the
workload across multiple machines. This makes it well-suited for handling massive amounts of data in a variety
of real-time applications.
YARN
What is YARN?

YARN (Yet Another Resource Navigator) is a core component of Hadoop introduced in Hadoop 2.x to
manage resources in a distributed computing environment. It functions as a resource management layer
for the Hadoop cluster, allowing multiple applications to share resources while improving the scalability
and efficiency of data processing

Need for YARN:

YARN was introduced to overcome the limitations of Hadoop’s original MapReduce framework, which
tightly coupled resource management and job execution. Some key needs for YARN are:

1. Scalability: As data sizes grew, there was a need for a more efficient resource management
system.
2. Resource Utilization: The earlier versions of Hadoop lacked fine-grained control over resources,
leading to underutilization.
3. Multi-tenancy: YARN enables multiple data-processing frameworks (not just MapReduce) to run
on a single cluster, making it more flexible.
4. Decoupling of Resource Management and Scheduling: YARN separates resource
management and job scheduling, allowing better control and scalability.

Advantages of YARN:

1. Scalability: YARN allows Hadoop to scale to larger clusters by efficiently managing resources
across different nodes.
2. Resource Efficiency: It optimizes the use of cluster resources by dynamically allocating
resources to tasks based on the needs of applications.
3. Flexibility: YARN supports various data-processing frameworks (MapReduce, Spark, Flink),
making Hadoop more versatile.
4. Multi-tenancy: Multiple applications can share the cluster resources without interfering with each
other.
5. Improved Performance: By decoupling resource management from job scheduling, YARN
provides a more efficient way to manage workloads.
6. Fault Tolerance: YARN improves fault tolerance by isolating resource management and
application execution, so failures in one don’t affect the other.
Components of YARN Architecture:

1. Resource Manager (RM):

○ Central authority that manages and allocates resources across the entire Hadoop
cluster.
○ It has two key components:
■ Scheduler:
■ Allocates resources to various running applications based on
resource requirements.
■ Ensures fairness and capacity but does not monitor or track job
statuses.
■ Application Manager:
■ Manages the lifecycle of applications (such as MapReduce or
Spark jobs).
■ Tracks job progress, handles job failures, and helps restart jobs.
2. Node Manager (NM):
○ Per-node agent responsible for managing individual nodes within the cluster.
○ Manages the lifecycle of containers (which are resource bundles like CPU and
memory) on its node.
○ Monitors resource usage (CPU, memory, disk) and reports it to the Resource
Manager.
○ Responsible for launching containers and killing them when tasks are done or fail.
3. Application Master (AM):
○ Per-application agent that negotiates resources with the Resource Manager.
○ Manages the execution of tasks within the application by requesting containers
from the Resource Manager.
○ Each application (e.g., a MapReduce or Spark job) has its own dedicated
Application Master.
○ Coordinates between the Node Managers to execute tasks and monitors the
application's progress.
○ It is short-lived, running only for the duration of the application it is managing.
4. Containers:
○ Execution units that are allocated by the Resource Manager and managed by the
Node Manager.
○ Containers are bundles of resources (CPU, memory, disk, network) and are where
individual tasks run.
○ Tasks such as MapReduce jobs or Spark operations are executed within
containers.
YARN Workflow:
1. Application Submission:
○ A client submits an application (e.g., a MapReduce job) to the Resource Manager.
2. Resource Allocation for Application Master:
○ The Resource Manager allocates the first container for the Application Master of that
job.
3. Application Master Negotiation:
○ The Application Master negotiates with the Resource Manager for additional containers
where the job tasks will be executed.
4. Task Execution:
○ The Node Manager on each node manages containers and runs the tasks inside these
containers, following the instructions from the Application Master.
5. Job Monitoring and Completion:
○ The Application Master monitors the execution of tasks, handles retries or failures, and
reports back to the Resource Manager once the application completes.
6. Resource Release:
○ Once the job is finished, the Application Master notifies the Resource Manager, and all
allocated containers and resources are released back to the pool for other jobs.

What Is Iot: 5 V of Big Data
No ratings yet
What Is Iot: 5 V of Big Data
17 pages
Big Data-One
No ratings yet
Big Data-One
9 pages
Block-2-Unit 5
No ratings yet
Block-2-Unit 5
101 pages
Notes Big Data
No ratings yet
Notes Big Data
106 pages
Big Data Complete Notes
100% (3)
Big Data Complete Notes
33 pages
BDA IA1 New
No ratings yet
BDA IA1 New
21 pages
Big Data Sources in Science & Tech
No ratings yet
Big Data Sources in Science & Tech
7 pages
Big Data Analytics
No ratings yet
Big Data Analytics
37 pages
Big Data Analytics Unit - 1 Notes
No ratings yet
Big Data Analytics Unit - 1 Notes
24 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Abhishek Seminar 222
No ratings yet
Abhishek Seminar 222
19 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
DBMS Unit1
No ratings yet
DBMS Unit1
30 pages
Big Data: Key Concepts and Applications
No ratings yet
Big Data: Key Concepts and Applications
25 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
Big Data Analytics
No ratings yet
Big Data Analytics
5 pages
Big Data Analysis by Deshbandhu
No ratings yet
Big Data Analysis by Deshbandhu
368 pages
What's Is Big D-WPS Office
No ratings yet
What's Is Big D-WPS Office
3 pages
V'S" V'S,"
No ratings yet
V'S" V'S,"
4 pages
Bdavdoc
No ratings yet
Bdavdoc
41 pages
Big Data Characteristics and Management
No ratings yet
Big Data Characteristics and Management
8 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Bda File New
No ratings yet
Bda File New
6 pages
Big Data
No ratings yet
Big Data
10 pages
Kwasu-Csc204 Big Data Computing and Security-1
No ratings yet
Kwasu-Csc204 Big Data Computing and Security-1
57 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
Very Imp Read Once
No ratings yet
Very Imp Read Once
30 pages
BDA Notes Part 1
No ratings yet
BDA Notes Part 1
11 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
TIE - 21CS71 SIMP With Key Answers
No ratings yet
TIE - 21CS71 SIMP With Key Answers
19 pages
Big Data Analytics Unit-1
100% (2)
Big Data Analytics Unit-1
5 pages
BDA1-4 Bunits
No ratings yet
BDA1-4 Bunits
113 pages
Unit - 1
No ratings yet
Unit - 1
15 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
Bda Unit-1 Notes
No ratings yet
Bda Unit-1 Notes
10 pages
CCS334 Big Data Analytics Overview
No ratings yet
CCS334 Big Data Analytics Overview
16 pages
Understanding Big Data and Hadoop Basics
No ratings yet
Understanding Big Data and Hadoop Basics
17 pages
Unit - 1 Bda
No ratings yet
Unit - 1 Bda
14 pages
Kwasu-Csc204 Module 1 Big Data Computing and Security 2
No ratings yet
Kwasu-Csc204 Module 1 Big Data Computing and Security 2
22 pages
DSBDA Unit 3 Notes
No ratings yet
DSBDA Unit 3 Notes
16 pages
Big Data Hadoop Complete Final Spaced
No ratings yet
Big Data Hadoop Complete Final Spaced
15 pages
CCS334
No ratings yet
CCS334
55 pages
Bid Data Analytics
No ratings yet
Bid Data Analytics
5 pages
Unit 1
No ratings yet
Unit 1
23 pages
Big Data Analytics - Drivers
No ratings yet
Big Data Analytics - Drivers
39 pages
Document 1
No ratings yet
Document 1
9 pages
BDAchap 1
No ratings yet
BDAchap 1
15 pages
BD U-1 (Anupam Sir)
No ratings yet
BD U-1 (Anupam Sir)
20 pages
Big Data Ashish
No ratings yet
Big Data Ashish
7 pages
Classifying Data For Big Data Analytics
No ratings yet
Classifying Data For Big Data Analytics
28 pages
Notesfor BDA
No ratings yet
Notesfor BDA
59 pages
Master Spark Concepts
No ratings yet
Master Spark Concepts
112 pages
Untitled Document-3
No ratings yet
Untitled Document-3
5 pages
Module 1
No ratings yet
Module 1
29 pages
Hadoop Research Paper
No ratings yet
Hadoop Research Paper
7 pages
Bigdata
No ratings yet
Bigdata
12 pages
MM Session
No ratings yet
MM Session
75 pages
Understanding Hypergeometric Distribution
No ratings yet
Understanding Hypergeometric Distribution
9 pages
Simple Regression for Marketers
No ratings yet
Simple Regression for Marketers
14 pages
Managerial Economics Notes
No ratings yet
Managerial Economics Notes
145 pages
High Voltage Pspice Manual PDF
No ratings yet
High Voltage Pspice Manual PDF
35 pages
A Cell Is Like A Factory
No ratings yet
A Cell Is Like A Factory
13 pages
Report 01 - Role of Inbound Content in Generating Continuous Conversions
No ratings yet
Report 01 - Role of Inbound Content in Generating Continuous Conversions
12 pages
Color Pals Privacy Policy
No ratings yet
Color Pals Privacy Policy
7 pages
Kali Linux On W11
No ratings yet
Kali Linux On W11
5 pages
Sweex LW050v2 Router User Manual
No ratings yet
Sweex LW050v2 Router User Manual
22 pages
Size Structured Population Models of Dap
No ratings yet
Size Structured Population Models of Dap
8 pages
International Standard Banking Practice: Documents and The Need For Completion of A Box, Field or Space
No ratings yet
International Standard Banking Practice: Documents and The Need For Completion of A Box, Field or Space
1 page
Rail Safety &amp Standards Board
No ratings yet
Rail Safety &amp Standards Board
19 pages
R&D Engineer Job at SEDEMAC Mechatronics
No ratings yet
R&D Engineer Job at SEDEMAC Mechatronics
2 pages
Roblox Lesson 16
No ratings yet
Roblox Lesson 16
9 pages
Motor Data Sheet (90kw)
No ratings yet
Motor Data Sheet (90kw)
7 pages
Barrication 02
No ratings yet
Barrication 02
1 page
RMI Water Resources and Quality Monitoring
No ratings yet
RMI Water Resources and Quality Monitoring
31 pages
Faith and Liberty The Economic Thought of The Late Scholastics Alejandro
No ratings yet
Faith and Liberty The Economic Thought of The Late Scholastics Alejandro
229 pages
Startek Zero Tolerance Protocol Agreement
No ratings yet
Startek Zero Tolerance Protocol Agreement
3 pages
Tabular Analysis of Transactions
No ratings yet
Tabular Analysis of Transactions
3 pages
Paper - XVIII ECSMGE (2024) Chalk Dissolution Paper v2.0
No ratings yet
Paper - XVIII ECSMGE (2024) Chalk Dissolution Paper v2.0
6 pages
Company Analysis Content
No ratings yet
Company Analysis Content
8 pages
ER Model and Database Design
No ratings yet
ER Model and Database Design
40 pages
Diccionario de Desechos Hospitalarios en Ingles
No ratings yet
Diccionario de Desechos Hospitalarios en Ingles
5 pages
Safe Chemical Development Practices: Risks From Rising Temperature
No ratings yet
Safe Chemical Development Practices: Risks From Rising Temperature
14 pages
Air Ambulance
No ratings yet
Air Ambulance
1 page
Ahu Economiser
No ratings yet
Ahu Economiser
2 pages
Bank Guarantee for Contracts
No ratings yet
Bank Guarantee for Contracts
3 pages
Liquidity Management and Profitability: A Case Study of Listed Manufacturing Companies in Sri Lanka
100% (1)
Liquidity Management and Profitability: A Case Study of Listed Manufacturing Companies in Sri Lanka
5 pages
Vakalatnama
No ratings yet
Vakalatnama
1 page
Financial Analytics
No ratings yet
Financial Analytics
26 pages
Tshwane Update 2023 3rd Edition
No ratings yet
Tshwane Update 2023 3rd Edition
6 pages
Nas121 PDF
No ratings yet
Nas121 PDF
1 page

IOTBDM - Mid Sem

Uploaded by

IOTBDM - Mid Sem

Uploaded by

what is big data and its features

Features of Big Data analytics:

The "5 Vs" of big data are a framework

how IoT and big data work together

Core Components of HDFS Architecture:

Imagine HDFS as a giant library.

Benefits of using fsimage and edits log:

How rack awareness works:

Read and Write Operations in HDFS

● Storage: DataNodes are like shelves that store data blocks.

What is map reduce?

● Input: A large text file.

● Scalability: Can handle massive datasets.

Common Use Cases:

● Word Counting: Counting the frequency of words in a large text corpus.

Relationship between JobTracker and TaskTracker:

Here are some real-time examples of how MapReduce can be used:

Need for YARN:

1. Resource Manager (RM):

You might also like