0% found this document useful (0 votes)
33 views34 pages

Overview of Hadoop Ecosystem Components

The document discusses the Hadoop ecosystem and its major components. It describes HDFS as the primary component for storing large datasets across nodes. YARN manages cluster resources and scheduling. Major tools in the ecosystem include Spark, Pig/Hive for querying, HBase for NoSQL, and Mahout/MLLib for machine learning. HDFS high availability and federation are also summarized, which improve scalability and avoid single points of failure. MRv2/YARN improved on MRv1 by removing bottlenecks and allowing different data processing engines. NoSQL databases are also introduced as being designed for scalability and high performance with data stored as key-value pairs.

Uploaded by

Apoorva Rauniyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views34 pages

Overview of Hadoop Ecosystem Components

The document discusses the Hadoop ecosystem and its major components. It describes HDFS as the primary component for storing large datasets across nodes. YARN manages cluster resources and scheduling. Major tools in the ecosystem include Spark, Pig/Hive for querying, HBase for NoSQL, and Mahout/MLLib for machine learning. HDFS high availability and federation are also summarized, which improve scalability and avoid single points of failure. MRv2/YARN improved on MRv1 by removing bottlenecks and allowing different data processing engines. NoSQL databases are also introduced as being designed for scalability and high performance with data stored as key-value pairs.

Uploaded by

Apoorva Rauniyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT – IV

Hadoop Echo System:


Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. It
includes Apache projects and various commercial tools and solutions. There are four major elements of Hadoop
i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to supplement or
support these major elements. All these tools work collectively to provide services such as absorption, analysis,
storage and maintenance of data etc.
• Following are the components that collectively form a Hadoop ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


Note: Apart from the above-mentioned components, there are many other components too that are part of the
Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of Hadoop that it revolves
around data and hence making its synthesis easier.
HDFS:
• HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data
sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the
form of log files.
• HDFS consists of two core components i.e.
1. Name node
2. Data Node
• Name Node is the prime node which contains metadata (data about data) requiring comparatively fewer
resources than the data nodes that stores the actual data. These data nodes are commodity hardware in the
distributed environment. Undoubtedly, making Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of the
system.
YARN:
• Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources
across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System.
• Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
• Resource manager has the privilege of allocating resources for the applications in a system whereas Node
managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of the two.
Hadoop 2.0 New Features-NameNode high availability:
High Availability was a new feature added to Hadoop 2.x to solve the Single point of failure problem in the
older versions of Hadoop. As the Hadoop HDFS follows the master-slave architecture where the NameNode is
the master node and maintains the filesystem tree. So HDFS cannot be used without NameNode. This NameNode
becomes a bottleneck. HDFS high availability feature addresses this issue.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


• Before Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster
had a single NameNode, and if NameNode fails, the cluster as a whole would be out of services. The cluster
will be unavailable until the NameNode restarts or brought on a separate machine.
• Hadoop 2.0 overcomes this SPOF by providing support for many NameNode. HDFS NameNode High
Availability architecture provides the option of running two redundant NameNodes in the same cluster in an
active/passive configuration with a hot standby.
1. Active NameNode – It handles all client operations in the cluster.
2. Passive NameNode – It is a standby namenode, which has similar data as active NameNode. It acts
as a slave, maintains enough state to provide a fast failover, if necessary.
• If Active NameNode fails, then passive NameNode takes all the responsibility of active node and the cluster
continues to work.
• Issues in maintaining consistency in the HDFS High Availability cluster are as follows:
✓ Active and Standby NameNode should always be in sync with each other, i.e. they should have the
same metadata. This permit reinstating the Hadoop cluster to the same namespace state where it got
crashed. And this will provide us to have fast failover.
✓ There should be only one NameNode active at a time. Otherwise, two NameNode will lead to
corruption of the data. We call this scenario a “Split-Brain Scenario”, where a cluster gets divided into
the smaller cluster. Each one believes that it is the only active cluster. “Fencing” avoids such scenarios.
Fencing is a process of ensuring that only one NameNode remains active at a particular time.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


HDFS Federation:
HDFS Federation improves the existing HDFS architecture through a clear separation of namespace and
storage, enabling a generic block storage layer. It enables support for multiple namespaces in the cluster to
improve scalability and isolation. Federation also opens up the architecture, expanding the applicability of HDFS
clusters to new implementations and use cases.
To scale the name service horizontally, the federation uses multiple independent namenodes/namespaces. The
namenodes are federated, that is, the namenodes are independent and don’t require coordination with each other.
The datanodes are used as common storage for blocks by all the namenodes. Each datanode registers with all the
namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handle commands from the
namenodes.
A Block Pool is a set of blocks that belong to a single namespace. Datanodes store blocks for all the block pools
in the cluster. It is managed independently of other block pools. This allows a namespace to generate Block IDs
for new blocks without the need for coordination with the other namespaces. The failure of a namenode does not
prevent the datanode from serving other namenodes in the cluster.
A Namespace and its block pool together are called Namespace Volume. It is a self-contained unit of
management. When a namenode/namespace is deleted, the corresponding block pool at the datanodes is deleted.
Each namespace volume is upgraded as a unit, during cluster upgrade.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


MRv2:
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each
data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring
resources/tasks, etc. The MapReduce framework in the Hadoop 1.x version is also known as MRv1. The MRv1
framework includes client communication, job execution and management, resource scheduling and resource
management. The Hadoop daemons associated with MRv1 are JobTracker and TaskTracker as shown in the
following figure:
YARN:
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to remove the
bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described as a “Redesigned Resource
Manager” at the time of its launching, but it has now evolved to be known as a large-scale distributed operating
system used for Big Data processing.YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run and process data stored in HDFS
(Hadoop Distributed File System) thus making the system much more efficient. Through its various components,
it can dynamically allocate various resources and schedule the application processing. For large volume data
processing, it is quite necessary to manage the available resources properly so that every application can leverage
them.
Running MRv1 in YARN:
• YARN uses the ResourceManager web interface for monitoring applications running on a YARN cluster. The
ResourceManager UI shows the basic cluster metrics, list of applications, and nodes associated with the
cluster. In this section, we'll discuss the monitoring of MRv1 applications over YARN.
• The Resource Manager is the core component of YARN – Yet Another Resource Negotiator. In analogy, it
occupies the place of JobTracker of MRV1. Hadoop YARN is designed to provide a generic and flexible
framework to administer the computing resources in the Hadoop cluster.
• In this direction, the YARN Resource Manager Service (RM) is the central controlling authority for resource
management and makes allocation decisions ResourceManager has two main components: Scheduler and
ApplicationsManager.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


NoSQL Databases:
Introduction to NoSQL:
A term for any type of database that does not use SQL for the primary retrieval of data from the database.
NoSQL databases have limited traditional functionality and are designed for scalability and high performance
retrieve and append. Typically, NoSQL databases store data as key-value pairs, which works well for data that is
unrelated.
NoSQL databases can store relationship data—they just store it differently than relational databases do. When
compared with SQL databases, many find modelling relationship data in NoSQL databases to be easier than in
SQL databases, because related data doesn’t have to be split between tables. NoSQL data models allow related
data to be nested within a single data structure.
• Here are the four main types of NoSQL databases:
1.Document databases:
A document database stores data in JSON, BSON, or XML documents (not Word documents or Google docs,
of course). In a document database, documents can be nested. Particular elements can be indexed for faster
querying.
Documents can be stored and retrieved in a form that is much closer to the data objects used in applications,
which means less translation is required to use the data in an application. SQL data must often be assembled and
disassembled when moving back and forth between applications and storage.
Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)
2.Key-value stores:
Key-value databases are a simpler type of database where each item contains keys and values. A value can
typically only be retrieved by referencing its key, so learning how to query for a specific key-value pair is typically
simple. Key-value databases are great for use cases where you need to store large amounts of data but you don’t
need to perform complex queries to retrieve it. Common use cases include storing user preferences or caching.
Redis and DynamoDB are popular key-value databases.
3.Column-oriented databases:
Wide-column stores store data in tables, rows, and dynamic columns. Wide-column stores provide a lot of
flexibility over relational databases because each row is not required to have the same columns. Many consider
wide-column stores to be two-dimensional key-value databases. Wide-column stores are great for when you need
to store large amounts of data and you can predict what your query patterns will be. Wide-column stores are
commonly used for storing Internet of Things data and user profile data. Cassandra and HBase are two of the
most popular wide-column stores.
4.Graph databases:
Graph databases store data in nodes and edges. Nodes typically store information about people, places, and
things while edges store information about the relationships between the nodes. Graph databases excel in use
cases where you need to traverse relationships to look for patterns such as social networks, fraud detection, and
recommendation engines. Neo4j and JanusGraph are examples of graph databases.
MongoDB:
Introduction:
MongoDB is an open-source document database and leading NoSQL database. MongoDB is written in C++.
This tutorial will give you a great understanding of MongoDB concepts needed to create and deploy a highly
scalable and performance-oriented database.
Data Types:
MongoDB supports many data types. Some of them are −
• String − This is the most commonly used datatype to store the data. String in MongoDB must be UTF-8
valid.
• Integer − This type is used to store a numerical value. Integer can be 32 bit or 64 bit depending upon your
server.
• Boolean − This type is used to store a boolean (true/ false) value.
• Double − This type is used to store floating-point values.
• Min/ Max keys − This type is used to compare a value against the lowest and highest BSON elements.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


• Arrays − This type is used to store arrays or lists or multiple values into one key.
• Timestamp − timestamp. This can be handy for recording when a document has been modified or added.
• Object − This data type is used for embedded documents.
• Null − This type is used to store a Null value.
• Symbol − This datatype is used identically to a string; however, it's generally reserved for languages that
use a specific symbol type.
• Date − This data type is used to store the current date or time in UNIX time format. You can specify your
own date time by creating an object of Date and passing day, month, a year into it.
• Object ID − This data type is used to store the document’s ID.
• Binary data − This data type is used to store binary data.
• Code − This data type is used to store JavaScript code into the document.
• Regular expression − This data type is used to store regular expression.
Creating Document:
• Insert a Single Document
db.collection.insertOne() inserts a single document into a collection.
• Insert Multiple Document
db.collection.insertMany() can insert multiple documents into a collection. Pass an array of documents to the
method.
• Updating Document:
db.collection.updateOne(<filter>, <update>, <options>)
➢ Updates at most a single document that matches a specified filter even though multiple documents may match
the specified filter.
db.collection.updateMany(<filter>, <update>, <options>)
➢ Update all documents that match a specified filter.
db.collection.replaceOne(<filter>, <update>, <options>)
➢ Replaces at most a single document that matches a specified filter even though multiple documents may match
the specified filter.
db.collection.update()
➢ Either updates or replaces a single document that matches a specified filter or updates all documents that
match a specified filter.
• By default, the db.collection.update() method updates a single document. To update multiple documents, use
the multi option.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


Deleting Documents:
db.collection.deleteMany()
➢ Delete all documents that match a specified filter.
db.collection.deleteOne()
➢ Delete at most a single document that matches a specified filter even though multiple documents may match
the specified filter.
db.collection.remove()
➢ Delete a single document or all documents that match a specified filter.
db.collection.findOneAndDelete()
➢ findOneAndDelete() provides a sort option. The option allows for the deletion of the first document sorted by
the specified order.
db.collection.findAndModify()
➢ db.collection.findAndModify() provides a sort option. The option allows for the deletion of the first document
sorted by the specified order.
db.collection.bulkWrite()
Querying:
find() Method
To query data from MongoDB collection, you need to use MongoDB's find() method.
Syntax
The basic syntax of find() method is as follows −
>db.COLLECTION_NAME.find()
• find() method will display all the documents in a non-structured way.
pretty() Method
To display the results in a formatted way, you can use pretty() method.
Syntax
>db.COLLECTION_NAME.find().pretty()
findOne() method
• Apart from the find() method, there is findOne() method, that returns only one document.
Syntax
>db.COLLECTIONNAME.findOne()

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


AND in MongoDB
Syntax
• To query documents based on the AND condition, you need to use $and keyword. Following is the basic
syntax of AND −
>db.mycol.find({ $and: [ {<key1>:<value1>}, { <key2>:<value2>} ] })
OR in MongoDB
Syntax
• To query documents based on the OR condition, you need to use $or keyword. Following is the basic syntax
of OR −
>db.mycol.find(
{
$or: [
{key1: value1}, {key2:value2}
]
}
).pretty()
NOR in MongoDB
Syntax
• To query documents based on the NOT condition, you need to use the $not keyword. Following is the basic
syntax of NOT −
>db.COLLECTION_NAME.find(
{
$not: [
{key1: value1}, {key2:value2}
]
}
)
NOT in MongoDB
Syntax
• To query documents based on the NOT condition, you need to use the $not keyword following is the basic
syntax of NOT −
>db.COLLECTION_NAME.find(
{
Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)
$NOT: [
{key1: value1}, {key2:value2}
]
}
).pretty()

Introduction to indexing
Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a
collection scan, i.e. scan every document in a collection, to select those documents that match the query statement.
If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must
inspect.
Indexes are special data structures that store a small portion of the collection's data set in an easy to traverse
form. The index stores the value of a specific field or set of fields, ordered by the value of the field. The ordering
of the index entries supports efficient equality matches and range-based query operations. In addition, MongoDB
can return sorted results by using the ordering in the index.
MongoDB supports indexes
• At the collection level
• Similar to indexes on RDBMS
Can be used for
– More efficient filtering
– More efficient sorting
– Index-only queries (covering index)
Types of Indexes
→ Default _id Index
• MongoDB creates a unique index on the _id field during the creation of a collection.
• The _id index prevents clients from inserting two documents with the same value for the_id field.
• You cannot drop this index on the _id field.
→ Create an Index
• To create an index, use
db.collection.createIndex()
db.collection.createIndex( <key and index type specification>, <options> )

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


The db.collection.createIndex() method only creates an index if an index of the same specification does not
already exist.
• MongoDB indexes use a B-tree data structure.
• MongoDB provides several different index types to support specific types of data and queries.
→ Single Field
• In addition to the MongoDB-defined _id index, MongoDB supports the creation of user-defined
ascending/descending indexes on a single field of a document.
The following example creates an ascending index on the field orderDate.
db.collection.createIndex( { orderDate: 1 } )
→ Compound Index
• MongoDB also supports user-defined indexes on multiple fields, i.e. compound indexes.
• The order of fields listed in a compound index has significance. For instance, if a compound index consists
of { userid: 1, score: -1 }, the index sorts first by userid and then, within each userid value, sorts by score.
• The following example creates a compound index on the orderDate field (in ascending order) and the zip
code field (in descending order.)
db.collection.createIndex( { orderDate: 1, zipcode: -1 } )
→ Multikey Index
• MongoDB uses multikey indexes to index the content stored in arrays.
• If you index a field that holds an array value, MongoDB creates separate index entries for every element
of the array.
• These multikey indexes allow queries to select documents that contain arrays by matching on elements or
elements of the arrays.
• MongoDB automatically determines whether to create a multikey index if the indexed field contains an
array value; you do not need to explicitly specify the multikey type.
→ Index Use
• Indexes can improve the efficiency of reading operations.
• Covered Queries
• When the query criteria and the projection of a query includes only the indexed fields,
• MongoDB will return results directly from the index without scanning any documents or bringing
documents into memory. These covered queries can be very efficient.
• Index Intersection(New in version 2.6.)
• MongoDB can use the intersection of indexes to fulfil queries.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


• For queries that specify the compound query conditions, if one index can fulfil a part of a query condition,
and another index can fulfil another part of the query condition, then MongoDB can
• use the intersection of the two indexes to fulfil the query.
• To illustrate index intersection, consider collection orders that have the following indexes:
{ qty: 1 }
{ item: 1 }
• MongoDB can use the intersection of the two
• indexes to support the following query:
db.orders.find( { item: "abc123", qty: { $gt: 15} } )
→ Remove Indexes
• You can use the following methods to remove indexes:
db.collection.dropIndex() method
db.accounts.dropIndex( { "tax-id": 1 } )
The above operation removes an ascending index on the item field in the items collection.
Db.collection.drop indexes()
To remove all indexes barring the _id index from a collection, use the operation above.
→ Modify Indexes
• To modify an index, first, drop the index and then recreate it.
• Drop Index: Execute the query given below to return a document showing the operation status.
db.orders.dropIndex({ "cust_id" : 1, "ord_date" :-1, "items" : 1 })
Recreate the Index: Execute the query given below to return a document showing the status of the results.
db.orders.createIndex({ "cust_id" : 1, "ord_date" : -1, "items" : -1 })
→ Rebuild Indexes
• In addition to modifying indexes, you can also rebuild them.
• To rebuild all indexes of a collection, use the db.collection.reIndex() method.
• This will drop all indexes including _id and rebuild all indexes in a single operation.
Capped Collections:
Capped collections are fixed-size collections that support high-throughput operations that insert and retrieve
documents based on insertion order. Capped collections work in a way similar to circular buffers: once a collection
fills its allocated space, it makes room for new documents by overwriting the oldest documents in the collection.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


Procedures
→ Create a Capped Collection
You must create capped collections explicitly using the db.createCollection() method, which is a helper in the
mongo shell for the create command. When creating a capped collection you must specify the maximum size of
the collection in bytes, which MongoDB will pre-allocate for the collection. The size of the capped collection
includes a small amount of space for internal overhead.
→ Query a Capped Collection
If you perform a find() on a capped collection with no ordering specified, MongoDB guarantees that the
ordering of results is the same as the insertion order.
• To retrieve documents in reverse insertion order, issue find() along with the sort() method with the $natural
parameter set to -1, as shown in the following example:
db.cappedCollection.find().sort( { $natural: -1 } )
→ Check if a Collection is Capped
Use the isCapped() method to determine if a collection is capped, as follows:
db.collection.isCapped()
→ Convert a Collection to Capped
You can convert a non-capped collection to a capped collection with the convertToCapped command:
db.runCommand({"convertToCapped": "mycoll", size: 100000});
• The size parameter specifies the size of the capped collection in bytes.
Spark:
Installing spark
• Choose a Spark release: 3.1.2 (Jun 01 2021)3.0.3 (Jun 23 2021)
• Choose a package type: Pre-built for Apache Hadoop 3.2 and later Pre-built for Apache Hadoop 2.7 Pre-
built with user-provided Apache Hadoop Source Code
• Download Spark: spark-3.1.2-bin-hadoop3.2.tgz
• Verify this release using the 3.1.2 signatures, checksums and project release KEYS.
• Note that, Spark 2.x is pre-built with Scala 2.11 except version 2.4.2, which is pre-built with Scala 2.12. Spark
3.0+ is pre-built with Scala 2.12.
Spark Applications:
1. Processing Streaming Data
The most wonderful aspect of Apache Spark is its ability to process streaming data. Every second, an
unprecedented amount of data is generated globally. This pushes companies and businesses to process data in

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


large bulks and analyze it in real-time. The Spark Streaming feature can efficiently handle this function. By
unifying disparate data processing capabilities, Spark Streaming allows developers to use a single framework to
accommodate all their processing requirements. Some of the best features of Spark Streaming are:
Streaming ETL – Spark’s Streaming ETL continually cleans and aggregates the data before pushing it into data
repositories, unlike the complicated process of conventional ETL (extract, transform, load) tools used for batch
processing in data warehouse environments – they first read the data, then convert it to a database compatible
format, and finally, write it to the target database.
Data enrichment – This feature helps to enrich the quality of data by combining it with static data, thus,
promoting real-time data analysis. Online marketers use data enrichment capabilities to combine historical
customer data with live customer behaviour data for delivering personalized and targeted ads to customers in real-
time.
Trigger event detection – The trigger event detection feature allows you to promptly detect and respond to
unusual behaviours or “trigger events” that could compromise the system or create a serious problem within it.
While financial institutions leverage this capability to detect fraudulent transactions, healthcare providers use it
to identify potentially dangerous health changes in the vital signs of a patient and automatically send alerts to the
caregivers so that they can take the appropriate actions.
Complex session analysis – Spark Streaming allows you to group live sessions and events ( for example, user
activity after logging into a website/application) together and also analyze them. Moreover, this information can
be used to update ML models continually. Netflix uses this feature to obtain real-time customer behaviour insights
on the platform and to create more targeted show recommendations for the users.
2. Machine Learning
Spark has commendable Machine Learning abilities. It is equipped with an integrated framework for performing
advanced analytics that allows you to run repeated queries on datasets. This, in essence, is the processing of
Machine learning algorithms. Machine Learning Library (MLlib) is one of Spark’s most potent ML components.
This library can perform clustering, classification, dimensionality reduction, and much more. With MLlib, Spark
can be used for many Big Data functions such as sentiment analysis, predictive intelligence, customer
segmentation, and recommendation engines, among other things.
Another mention-worthy application of Spark is network security. By leveraging the diverse components of
the Spark stack, security providers/companies can inspect data packets in real-time inspections for detecting any
traces of malicious activity. Spark Streaming enables them to check any known threats before passing the packets
on to the repository. When the packets arrive in the repository, they are further analyzed by other Spark
components (for instance, MLlib). In this way, Spark helps security providers to identify and detect threats as
they emerge, thereby enabling them to solidify client security.
Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)
3. Fog Computing
Fog Computing decentralizes data processing and storage. However, certain complexities accompany Fog
Computing – it requires low latency, massively parallel processing of ML, and incredibly complex graph analytics
algorithms. Thanks to vital stack components like Spark Streaming, MLlib, and GraphX (a graph analysis engine),
Spark performs excellently as a capable Fog Computing solution.
Spark jobs, stages and tasks:
Job - A parallel computation consisting of multiple tasks that get spawned in response to a Spark action (e.g.,
save(), collect()). During interactive sessions with Spark shells, the driver converts your Spark application into
one or more Spark jobs. It then transforms each job into a DAG. This, in essence, is Spark’s execution plan, where
each node within a DAG could be single or multiple Spark stages.
Stage - Each job gets divided into smaller sets of tasks called stages that depend on each other. As part of the
DAG nodes, stages are created based on what operations can be performed serially or in parallel. Not all Spark
operations can happen in a single stage, so they may be divided into multiple stages. Often stages are delineated
on the operator’s computation boundaries, where they dictate data transfer among Spark executors.
Task - A single unit of work or execution that will be sent to a Spark executor. Each stage is comprised of Spark
tasks (a unit of execution), which are then federated across each Spark executor; each task maps to a single core
and works on a single partition of data. As such, an executor with 16 cores can have 16 or more tasks working on
16 or more partitions in parallel, making the execution of Spark’s tasks exceedingly parallel.

Resilient Distributed Databases:


• Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed
collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on
different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-
defined classes.
Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)
• Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic
operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that
can be operated on in parallel.
• There are two ways to create RDDs − parallelizing an existing collection in your driver program or referencing
a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering
a Hadoop Input Format.
• Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first
discuss how MapReduce operations take place and why they are not so efficient.
Anatomy of a Spark job run:
• Spark application contains several components, all of which exists whether you are running Spark on a
single machine or across a cluster of hundreds or thousands of nodes.
• The components of the spark application are Driver, the Master, the Cluster Manager and the Executors.
• All of the spark components including the driver, master, executor processes run in java virtual
machines(JVMs). A JVM is a cross-platform runtime engine that executes the instructions compiled into
java bytecode. Scala, which spark is written in, compiles into bytecode and runs on JVMs.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


Spark on YARN:
• Apache Spark is an in-memory distributed data processing engine and YARN is a cluster management
technology.
• When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce
schedules a container and fires up a JVM for each task, Spark hosts multiple tasks within the same
container. This approach enables several orders of magnitude faster task startup time.
• Spark supports two modes for running on YARN, “yarn-cluster” mode and “yarn-client” mode. Broadly,
yarn-cluster mode makes sense for production jobs, while yarn-client mode makes sense for interactive and
debugging uses where you want to see your application’s output immediately.
• Understanding the difference requires an understanding of YARN’s Application Master concept. In YARN,
each application instance has an Application Master process, which is the first container started for that
application. The application is responsible for requesting resources from the ResourceManager, and, when
allocated them, telling NodeManagers to start containers on its behalf. Application Masters obviate the need
for an active client — the process starting the application can go away and coordination continues from a
process managed by YARN running on the cluster.
• In yarn-cluster mode, the driver runs in the Application Master. This means that the same process is
responsible for both driving the application and requesting resources from YARN, and this process runs
inside a YARN container. The client that starts the app doesn’t need to stick around for its entire lifetime.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


YARN cluster mode
• The yarn-cluster mode, however, is not well suited to using Spark interactively. Spark applications that
require user input, like spark-shell and PySpark, need the Spark driver to run inside the client process that
initiates the Spark application. In yarn-client mode, the Application Master is merely present to request
executor containers from YARN. The client communicates with those containers to schedule work after
they start:

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


Yarn Client Mode

Different Deployment Modes across the cluster


In Yarn Cluster-Mode, Spark client will submit spark application to yarn, both Spark Driver and Spark
Executor are under the supervision of yarn. In yarn client mode, only the Spark Executor are under the
supervision of yarn. The Yarn ApplicationMaster will request resources for just spark executor. The driver
program is running in the client process which has nothing to do with yarn.
Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)
SCALA:
Introduction
Scala is a modern multi-paradigm programming language designed to express common programming patterns
in a concise, elegant, and type-safe way. It seamlessly integrates features of object-oriented and functional
languages.
Classes and Objects
A class is a blueprint for objects. Once you define a class, you can create objects from the class blueprint with
the keyword new. Through the object, you can use all functionalities of the defined class.
Class
Syntax
class Point(xc: Int, yc: Int) {
var x: Int = xc
var y: Int = yc

def move(dx: Int, dy: Int) {


x = x + dx
y = y + dy
println ("Point x location : " + x);
println ("Point y location : " + y);
}
• Following is a simple syntax to define a basic class in Scala. This class defines two variables x and y and a
method: move, which does not return a value. Class variables are called, fields of the class and methods are
called class methods.
• The class name works as a class constructor which can take several parameters. The above code defines two
constructor arguments, xc and yc; they are both visible in the whole body of the class.
• You can create objects using a keyword new and then you can access class fields and methods
Extending a Class
• You can extend a base Scala class and you can design an inherited class in the same way you do it in Java
(use extends keyword), but there are two restrictions: method overriding requires the override keyword, and
only the primary constructor can pass parameters to the base constructor.
Implicit Classes
Implicit classes allow implicit conversations with the class’s primary constructor when the class is in scope.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


An implicit class is a class marked with an ‘implicit’ keyword. This feature is introduced in Scala 2.10.
Syntax
object <object name> {
implicit class <class name>(<Variable>: Data type) {
def <method>(): Unit =
}
}
The following is the syntax for implicit classes. Here implicit class is always in the object scope where all
method definitions are allowed because the implicit class cannot be a top-level class.
Singleton Objects
Scala is more object-oriented than Java because, in Scala, we cannot have static members. Instead, Scala has
singleton objects. A singleton is a class that can have only one instance, i.e., Object. You create a singleton
using the keyword object instead of the class keyword. Since you can't instantiate a singleton object, you can't
pass parameters to the primary constructor.

Basic Types and Operators


Basic Data Types:

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


Operators:
An operator is a symbol that tells the compiler to perform specific mathematical or logical manipulations. Scala
is rich in built-in operators and provides the following types of operators −
1. Arithmetic Operators
2. Relational Operators
3. Logical Operators
4. Bitwise Operators
5. Assignment Operators
→ Arithmetic Operators
The following arithmetic operators are supported by the Scala language.

→ Relational Operators
The following relational operators are supported by the Scala language

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


→ Logical Operators
The following logical operators are supported by the Scala language.

→ Bitwise Operators
Bitwise operator works on bits and performs bit by bit operation. The truth tables for &, |, and ^ are as follows-

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


Assignment Operators
There are the following assignment operators supported by Scala language-

Built-in control structures:


Scala has only a handful of built-in control structures. The only control structures are if, while, for, try, match,
and function calls. The reason Scala has so few is that it has included function literals since its inception. Instead
of accumulating one higher-level control structure after another in the base syntax, Scala accumulates them in
libraries.
1. If expressions
Scala's if works just like in many other languages. It tests a condition and then executes one of two code
branches depending on whether the condition holds true. Here is a common example, written in an imperative
style:
var filename = "default.txt"
if (!args.isEmpty)
filename = args(0)
This code declares a variable, filename, and initializes it to a default value. It then uses and if expression to check

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


whether any arguments were supplied to the program. If so, it changes the variable to hold the value specified in
the argument list. If no arguments were supplied, it leaves the variable set to the default value.
2. While loops
Scala's while loop behaves as in other languages. It has a condition and a body, and the body is executed over and
over as long as the condition holds true.
example:
def gcdLoop(x: Long, y: Long): Long = {
var a = x
var b = y
while (a != 0) {
val temp = a
a=b%a
b = temp
}
b
}
• Scala also has a do-while loop. This works like the while loop except that it tests the condition after the loop
body instead of before.
• Below shows a Scala script that uses a do-while to echo lines read from the standard input until an empty line
is entered:
var line = ""
do {
line = readLine()
println("Read: "+ line)
} while (line != "")
3. For expressions
Scala's for expression is a Swiss army knife of iteration. It lets you combine a few simple ingredients in different
ways to express a wide variety of iterations. Simple uses enable common tasks such as iterating through a
sequence of integers. More advanced expressions can iterate over multiple collections of different kinds, can filter
out elements based on arbitrary conditions, and can produce new collections.
Iteration through collections
• The simplest thing you can do is to iterate through all the elements of a collection.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


• For example, below shows some code that prints out all files in the current directory. The I/O is performed
using the Java API. First, we create a java.io.File on the current directory, ".", and call its listFiles method.
This method returns an array of File objects, one per directory and file contained in the current directory.
We store the resulting array in the filesHere variable
val filesHere = (new java.io.File(".")).listFiles
for (file <- filesHere)
println(file)
Filtering
Sometimes you do not want to iterate through a collection in its entirety. You want to filter it down to some
subset. You can do this with a for expression by adding a filter: an if clause inside the for's parentheses.
• For example, the code shown below lists only those files in the current directory whose names end with
".scala":
val filesHere = (new java.io.File(".")).listFiles

for (file <- filesHere if file.getName.endsWith(".scala"))


println(file)
Nested iteration
If you add multiple <- clauses, you will get nested "loops." For example, the for expression shown below has
two nested loops. The outer loop iterates through filesHere, and the inner loop iterates through fileLines(file) for
any file that ends with .scala.
def fileLines(file: java.io.File) =
scala.io.Source.fromFile(file).getLines.toList
def grep(pattern: String) =
for (
file <- filesHere
if file.getName.endsWith(".scala");
line <- fileLines(file)
if line.trim.matches(pattern)
) println(file +": "+ line.trim)
grep(".*gcd.*")
Mid-stream variable bindings
Note that the previous code repeats the expression line.trim. This is a non-trivial computation, so you might
want to only compute it once. You can do this by binding the result to a new variable using an equals sign (=).
Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)
The bound variable is introduced and used just like a val, only with the val keyword left out. Below shows an
example.
def grep(pattern: String) =
for {
file <- filesHere
if file.getName.endsWith(".scala")
line <- fileLines(file)
trimmed = line.trim
if trimmed.matches(pattern)
} println(file +": "+ trimmed)
grep(".*gcd.*")
Producing a new collection
While all of the examples so far have operated on the iterated values and then forgotten them, you can also
generate a value to remember for each iteration. To do so, you prefix the body of the for expression by the keyword
yield. For example, here is a function that identifies the .scala files and stores them in an array:
def scalaFiles =
for {
file <- filesHere
if file.getName.ends with(".scala")
} yield file
Each time the body of the for expression executes it produces one value, in this case simply file. When the for
expression completes, the result will include all of the yielded values contained in a single collection. The type
of the resulting collection is based on the kind of collections processed in the iteration clauses. In this case, the
result is an Array[File], because filesHere is an array and the type of the yielded expression is File.
4. Exception handling with try expressions
Scala's exceptions behave just like in many other languages. Instead of returning a value in the normal way, a
method can terminate by throwing an exception. The method's caller can either catch and handle that exception,
or it can itself simply terminate, in which case the exception propagates to the caller's caller. The exception
propagates in this way, unwinding the call stack until a method handles it or there are no more methods left.
Throwing exceptions
Throwing an exception looks the same as in Java. You create an exception object and then you throw it with
the throw keyword:
throw new IllegalArgumentException
Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)
Catching exceptions
You catch exceptions using the syntax shown below. The syntax for catch clauses was chosen for its consistency
with an important part of Scala: pattern matching. Pattern matching, a powerful feature.
import java.io.FileReader
import java.io.FileNotFoundException
import java.io.IOException
try {
val f = new FileReader("input.txt") // Use and close file
} catch {
case ex: FileNotFoundException => // Handle missing file
case ex: IOException => // Handle other I/O error
}
The finally clause
You can wrap an expression with a finally clause if you want to cause some code to execute no matter how the
expression terminates. For example, you might want to be sure an open file gets closed even if a method exits by
throwing an exception. Below shown an example.
import java.io.FileReader
val file = new FileReader("input.txt")
try {
// Use the file
} finally {
file.close() // Be sure to close the file
}
Yielding a value
As with most other Scala control structures, try-catch-finally results in a value. For example, below shows how
you can try to parse a URL but use a default value if the URL is badly formed. The result is that of the try clause
if no exception is thrown, or the relevant catch clause if an exception is thrown and caught. If an exception is
thrown but not caught, the expression has no result at all. The value computed in the finally clause, if there is one,
is dropped. Usually, finally clauses do some kind of clean up such as closing a file; they should not normally
change the value computed in the main body or a catch clause of the try.
import java.net.URL
import java.net.MalformedURLException

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


def urlFor(path: String) =
try {
new URL(path)
} catch {
case e: MalformedURLException =>
new URL("http://www.scala-lang.org")
}
5. Match expressions
Scala's match expression lets you select from several alternatives, just like switch statements in other languages.
In general, a match expression lets you select using arbitrary patterns. For now, just consider using match to select
among several alternatives.
As an example, the script below reads a food name from the argument list and prints a companion to that food.
This match expression examines firstArg, which has been set to the first argument out of the argument list. If it
is the string "salt", it prints "pepper", while if it is the string "chips", it prints "salsa", and so on. The default case
is specified with an underscore (_), a wildcard symbol frequently used in Scala as a placeholder for a completely
unknown value.
val firstArg = if (args.length > 0) args(0) else ""
firstArg match {
case "salt" => println("pepper")
case "chips" => println("salsa")
case "eggs" => println("bacon")
case _ => println("huh?")
}

Functions
Scala has both functions and methods and we use the terms method and function interchangeably with a minor
difference. A Scala method is a part of a class that has a name, a signature, optionally some annotations, and some
bytecode whereas a function in Scala is a complete object which can be assigned to a variable. In other words, a
function, which is defined as a member of some object, is called a method.
Function Declarations
A Scala function declaration has the following form −
def functionName ([list of parameters]) : [return type]
• Methods are implicitly declared abstract if you don’t use the equals sign and the method body.
Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)
Function Definitions
• A Scala function definition has the following form −
Syntax
def functionName ([list of parameters]) : [return type] = {
function body
return [expr]
}
• Here, the return type could be any valid Scala data type and the list of parameters will be a list of variables
separated by a comma and the list of parameters and return type are optional.
Calling Functions
• Scala provides several syntactic variations for invoking methods. Following is the standard way to call a
method −
functionName( list of parameters )
• If a function is being called using an instance of the object, then we would use dot notation similar to Java as
follows −
[instance.]functionName( list of parameters )
Closures
Scala Closures are functions which uses one or more free variables and the return value of this function is
dependent of these variables. The free variables are defined outside of the Closure Function and are not included
as a parameter of this function. So, the difference between a closure function and a normal function is the free
variable.
A free variable is any kind of variable which is not defined within the function and not passed as the parameter
of the function. A free variable is not bound to a function with a valid value. The function does not contain any
values for the free variable.
Inheritance:
Inheritance is an important pillar of OOP(Object Oriented Programming). It is the mechanism in Scala by
which one class is allowed to inherit the features(fields and methods) of another class.
Important terminology:
Super Class: The class whose features are inherited is known as superclass(or a base class or a parent class).
Sub Class: The class that inherits the other class is known as a subclass(or a derived class, extended class, or child
class). The subclass can add its own fields and methods in addition to the superclass fields and methods.
Reusability: Inheritance supports the concept of “reusability”, i.e. when we want to create a new class and there

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


is already a class that includes some of the code that we want, we can derive our new class from the existing class.
By doing this, we are reusing the fields and methods of the existing class.
The keyword used for inheritance is extends.
Syntax:
class parent_class_name extends child_class_name{
// Methods and fields
}
➢ Type of inheritance
Below are the different types of inheritance which are supported by Scala:
Single Inheritance: In single inheritance, derived class inherits the features of one base class. In the image below,
class A serves as a base class for the derived class B.

Multilevel Inheritance: In Multilevel Inheritance, a derived class will be inheriting a base class and as well as
the derived class also act as the base class to another class. In the below image, class A serves as a base class for
the derived class B, which in turn serves as a base class for the derived class C.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


Hierarchical Inheritance: In Hierarchical Inheritance, one class serves as a superclass (base class) for more than
one subclass. In the below image, class A serves as a base class for the derived classes B, C, and D.

Multiple Inheritance: In Multiple inheritance, one class can have more than one superclass and inherit features
from all parent classes. Scala does not support multiple inheritance with classes, but it can be achieved by traits.

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)


Hybrid Inheritance: It is a mix of two or more of the above types of inheritance. Since Scala doesn’t support
multiple inheritance with classes, hybrid inheritance is also not possible with classes. In Scala, we can achieve
hybrid inheritance only through traits.

*****

Big Data Mr. Ajay Kumar (Assistant Professor, SDGI Ghaziabad)

You might also like