0% found this document useful (0 votes)
27 views4 pages

Hadoop Spark MongoDB SCALA Notes

The document outlines the Hadoop ecosystem, including components like HDFS, YARN, and various schedulers. It also introduces NoSQL databases, specifically MongoDB, detailing data types and operations. Additionally, it covers Apache Spark's architecture and Scala programming language features.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views4 pages

Hadoop Spark MongoDB SCALA Notes

The document outlines the Hadoop ecosystem, including components like HDFS, YARN, and various schedulers. It also introduces NoSQL databases, specifically MongoDB, detailing data types and operations. Additionally, it covers Apache Spark's architecture and Scala programming language features.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Hadoop Ecosystem and YARN

1. Hadoop Ecosystem Components:

- HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Flume, Oozie, Zookeeper, Ambari.

2. Schedulers:

- FIFO: First In First Out.

- Fair Scheduler: Resources shared equally among jobs.

- Capacity Scheduler: Resources allocated based on queue capacity.

3. Hadoop 2.0 New Features:

- NameNode High Availability: Eliminates single point of failure.

- HDFS Federation: Supports multiple NameNodes.

- MRv2 (MapReduce Version 2): Decouples resource management and job scheduling.

- YARN (Yet Another Resource Negotiator): Resource management layer.

- Running MRv1 in YARN: Backward compatibility with MRv1 applications.

NoSQL Databases

1. Introduction to NoSQL:

- Non-relational, distributed, schema-less databases.

- Types: Key-Value, Document, Column-Family, Graph databases.

MongoDB
1. Introduction:

- Document-oriented NoSQL database.

- Stores data in BSON format.

2. Data Types:

- String, Integer, Boolean, Double, Arrays, Objects, Null, Date, ObjectId.

3. Creating, Updating, and Deleting Documents:

- db.collection.insertOne(), insertMany()

- db.collection.updateOne(), updateMany()

- db.collection.deleteOne(), deleteMany()

4. Querying:

- db.collection.find(), findOne()

- Query operators: $gt, $lt, $in, $and, $or, $regex

5. Introduction to Indexing:

- Improves query performance.

- db.collection.createIndex({field: 1})

6. Capped Collections:

- Fixed-size collections that overwrite oldest data.

Apache Spark

1. Installing Spark:
- Download binaries, set environment variables, configure Spark.

2. Spark Applications, Jobs, Stages, and Tasks:

- Application: User program.

- Job: Triggered by an action.

- Stage: Set of tasks based on shuffle boundaries.

- Task: Smallest unit of work.

3. Resilient Distributed Datasets (RDDs):

- Immutable, distributed collection of objects.

- Supports transformations and actions.

4. Anatomy of a Spark Job Run:

- Driver program launches SparkContext.

- Executes transformations, actions.

- DAG scheduler creates stages, tasks distributed by TaskScheduler.

5. Spark on YARN:

- Allows Spark to run on Hadoop YARN for resource management.

SCALA

1. Introduction:

- Functional and object-oriented language.

- Runs on the JVM.


2. Classes and Objects:

- Class: Blueprint for objects.

- Object: Singleton instance.

3. Basic Types and Operators:

- Int, Float, Double, Char, Boolean.

- Operators: +, -, *, /, %, ==, !=, &&, ||

4. Built-in Control Structures:

- if, else, while, for, match-case

5. Functions and Closures:

- def functionName(parameters): returnType = {...}

- Closures: Functions with free variables.

6. Inheritance:

- class Subclass extends Superclass

- Supports traits for multiple inheritance.

You might also like