Big Data Final Revision Notes – Exam
Quick Recap
MapReduce
- Programming model with Map and Reduce phases.
- Processes large-scale data in parallel.
- Example: Word Count using key-value pairs.
HDFS Components
- NameNode: Manages metadata.
- DataNode: Stores actual data blocks.
- Secondary NameNode: Merges edit logs and fsimage.
Apache Pig
- Scripting platform using Pig Latin.
- Easier than raw MapReduce.
- Key commands: LOAD, STORE, JOIN, GROUP.
Apache Hive
- SQL-like tool for querying Hadoop data.
- Metastore stores schema.
- Supports partitioning, bucketing, indexing.
HiveQL vs SQL
- HiveQL: Used in Hadoop; batch processing.
- SQL: Used in RDBMS; real-time querying.
ELT vs ETL
- ETL: Extract → Transform → Load.
- ELT: Extract → Load → Transform (used in Big Data).
HBase vs RDBMS
- HBase: NoSQL, column-based, schema-free.
- RDBMS: SQL, row-based, fixed schema.
Data Collection Methods
- Log files, sensors, social media.
- Structured vs Unstructured data types.
Data Masking & PII
- PII: Personal data like name, Aadhaar.
- Masking techniques: Substitution, Encryption, Nulling.
Big Data in Real Life
- Fraud detection: Banking alerts, unusual patterns.
- Social media: Hashtag trends, customer behavior.