Hadoop Ecosystem and YARN
1. Hadoop Ecosystem Components:
- HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Flume, Oozie, Zookeeper, Ambari.
2. Schedulers:
- FIFO: First In First Out.
- Fair Scheduler: Resources shared equally among jobs.
- Capacity Scheduler: Resources allocated based on queue capacity.
3. Hadoop 2.0 New Features:
- NameNode High Availability: Eliminates single point of failure.
- HDFS Federation: Supports multiple NameNodes.
- MRv2 (MapReduce Version 2): Decouples resource management and job scheduling.
- YARN (Yet Another Resource Negotiator): Resource management layer.
- Running MRv1 in YARN: Backward compatibility with MRv1 applications.
NoSQL Databases
1. Introduction to NoSQL:
- Non-relational, distributed, schema-less databases.
- Types: Key-Value, Document, Column-Family, Graph databases.
MongoDB
1. Introduction:
- Document-oriented NoSQL database.
- Stores data in BSON format.
2. Data Types:
- String, Integer, Boolean, Double, Arrays, Objects, Null, Date, ObjectId.
3. Creating, Updating, and Deleting Documents:
- db.collection.insertOne(), insertMany()
- db.collection.updateOne(), updateMany()
- db.collection.deleteOne(), deleteMany()
4. Querying:
- db.collection.find(), findOne()
- Query operators: $gt, $lt, $in, $and, $or, $regex
5. Introduction to Indexing:
- Improves query performance.
- db.collection.createIndex({field: 1})
6. Capped Collections:
- Fixed-size collections that overwrite oldest data.
Apache Spark
1. Installing Spark:
- Download binaries, set environment variables, configure Spark.
2. Spark Applications, Jobs, Stages, and Tasks:
- Application: User program.
- Job: Triggered by an action.
- Stage: Set of tasks based on shuffle boundaries.
- Task: Smallest unit of work.
3. Resilient Distributed Datasets (RDDs):
- Immutable, distributed collection of objects.
- Supports transformations and actions.
4. Anatomy of a Spark Job Run:
- Driver program launches SparkContext.
- Executes transformations, actions.
- DAG scheduler creates stages, tasks distributed by TaskScheduler.
5. Spark on YARN:
- Allows Spark to run on Hadoop YARN for resource management.
SCALA
1. Introduction:
- Functional and object-oriented language.
- Runs on the JVM.
2. Classes and Objects:
- Class: Blueprint for objects.
- Object: Singleton instance.
3. Basic Types and Operators:
- Int, Float, Double, Char, Boolean.
- Operators: +, -, *, /, %, ==, !=, &&, ||
4. Built-in Control Structures:
- if, else, while, for, match-case
5. Functions and Closures:
- def functionName(parameters): returnType = {...}
- Closures: Functions with free variables.
6. Inheritance:
- class Subclass extends Superclass
- Supports traits for multiple inheritance.