0% found this document useful (0 votes)
48 views2 pages

Big Data Final Revision Notes

The document provides a comprehensive overview of key concepts in Big Data, including MapReduce, HDFS components, and tools like Apache Pig and Hive. It distinguishes between HiveQL and SQL, explains ETL vs ELT processes, and compares HBase with RDBMS. Additionally, it discusses data collection methods and the importance of data masking for protecting personally identifiable information (PII).

Uploaded by

dhotreanisha09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views2 pages

Big Data Final Revision Notes

The document provides a comprehensive overview of key concepts in Big Data, including MapReduce, HDFS components, and tools like Apache Pig and Hive. It distinguishes between HiveQL and SQL, explains ETL vs ELT processes, and compares HBase with RDBMS. Additionally, it discusses data collection methods and the importance of data masking for protecting personally identifiable information (PII).

Uploaded by

dhotreanisha09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Big Data Final Revision Notes – Exam

Quick Recap
MapReduce
- Programming model with Map and Reduce phases.
- Processes large-scale data in parallel.
- Example: Word Count using key-value pairs.

HDFS Components
- NameNode: Manages metadata.
- DataNode: Stores actual data blocks.
- Secondary NameNode: Merges edit logs and fsimage.

Apache Pig
- Scripting platform using Pig Latin.
- Easier than raw MapReduce.
- Key commands: LOAD, STORE, JOIN, GROUP.

Apache Hive
- SQL-like tool for querying Hadoop data.
- Metastore stores schema.
- Supports partitioning, bucketing, indexing.

HiveQL vs SQL
- HiveQL: Used in Hadoop; batch processing.
- SQL: Used in RDBMS; real-time querying.

ELT vs ETL
- ETL: Extract → Transform → Load.
- ELT: Extract → Load → Transform (used in Big Data).

HBase vs RDBMS
- HBase: NoSQL, column-based, schema-free.
- RDBMS: SQL, row-based, fixed schema.
Data Collection Methods
- Log files, sensors, social media.
- Structured vs Unstructured data types.

Data Masking & PII


- PII: Personal data like name, Aadhaar.
- Masking techniques: Substitution, Encryption, Nulling.

Big Data in Real Life


- Fraud detection: Banking alerts, unusual patterns.
- Social media: Hashtag trends, customer behavior.

You might also like