Detailed Exam Notes: Big Data and Hadoop
Unit I: Introduction to Big Data and Hadoop
1. Big Data Analytics:
- Big Data refers to datasets that are too large or complex to process using traditional methods.
- Big Data Analytics involves analyzing such datasets to uncover hidden patterns, correlations,
and insights.
- Types of Data:
- Structured: Tabular data with rows and columns (e.g., databases).
- Semi-structured: Data with some structure, like JSON or XML.
- Unstructured: Data with no predefined format (e.g., images, videos, emails).
2. History of Hadoop:
- Hadoop was inspired by Google's MapReduce and GFS (Google File System).
- Doug Cutting and Mike Cafarella created Hadoop, which became an open-source framework.
- Yahoo played a major role in the development and adoption of Hadoop.
3. Hadoop Ecosystem:
- Comprises tools that work together to process and analyze Big Data.
- Core components: HDFS (storage), MapReduce (processing).
- Supporting tools: Hive, Pig, HBase, Sqoop, Flume, Oozie, and Zookeeper.
4. IBM Big Data Strategy:
- IBM Infosphere BigInsights integrates Hadoop into enterprise environments for better data
management.
- Provides advanced tools like text analytics, machine learning, and enterprise-grade security.
Unit II: HDFS (Hadoop Distributed File System)
1. HDFS Concepts:
- HDFS is a distributed storage system designed to store very large datasets across multiple
nodes.
- Data is divided into blocks (default size: 128 MB) and stored across a cluster of nodes.
- Features include fault tolerance, high throughput, and scalability.
2. Data Ingestion:
- Flume: Used for collecting, aggregating, and moving large amounts of log data into HDFS.
- Sqoop: Transfers data between HDFS and relational databases like MySQL.
3. Hadoop I/O:
- Compression: Reduces the size of data to save storage and improve performance.
- Serialization: Converts data into a format that can be stored or transmitted (e.g., Avro, Thrift).
Unit III: MapReduce
1. Anatomy of MapReduce Job:
- Splits input data into smaller chunks.
- Mapper processes chunks in parallel and generates key-value pairs.
- Reducer combines and aggregates intermediate outputs from mappers.
2. Shuffle and Sort:
- Sorts mapper outputs by key and distributes them to reducers.
3. Job Scheduling:
- Ensures tasks are executed efficiently.
- Types of schedulers: FIFO (First In First Out), Fair Scheduler, Capacity Scheduler.
Unit IV: Hadoop Ecosystem Tools
1. Pig:
- High-level scripting platform for data transformation and analysis.
- Uses Pig Latin, a language simpler than Java.
- Suitable for tasks like ETL (Extract, Transform, Load).
2. Hive:
- A data warehouse infrastructure on top of Hadoop.
- HiveQL allows querying data using an SQL-like syntax.
- Integrates with HDFS for large-scale data analysis.
3. HBase:
- A NoSQL database built on top of HDFS for real-time processing.
- Stores data in a columnar format, making it faster than relational databases (RDBMS).
Unit V: Data Analytics with R and Machine Learning
1. Supervised Learning:
- Models are trained using labeled data (input-output pairs).
- Examples: Regression (predicting values) and Classification (categorizing data).
2. Unsupervised Learning:
- Works on unlabeled data to identify patterns and relationships.
- Examples: Clustering (grouping similar items) and Dimensionality Reduction.
3. Collaborative Filtering:
- Used in recommender systems (e.g., Amazon, Netflix).
- Based on user behavior or item similarity to suggest relevant items.