Unit 1: Introduction to Big Data
(10-mark answers for each topic)
1. Big Data and Its Importance
• Definition: Big Data refers to datasets that are too large or complex to process using
traditional methods.
• Importance:
o Enables data-driven decision-making. o Provides predictive insights
in fields like healthcare, finance, and marketing.
o Drives innovation and operational efficiency.
• Key Applications:
o Healthcare: Personalized medicine and real-time monitoring.
o Retail: Enhanced customer personalization and inventory management.
2. Characteristics of Big Data (5 V's)
The key properties of Big Data are summarized as:
1. Volume: The size of data, measured in terabytes or petabytes.
2. Velocity: The speed at which data is generated and processed (e.g., social media).
3. Variety: Data in different formats like text, images, videos, etc.
4. Veracity: Ensuring accuracy and reliability of data despite inconsistencies.
5. Value: Deriving meaningful insights to enhance business operations.
3. Big Data Analytics
• Definition: The process of analyzing large and varied datasets to uncover hidden
patterns, correlations, and actionable insights.
• Steps in Big Data Analytics:
1. Data Collection: Gathering structured, semi-structured, and unstructured
data.
2. Storage: Using platforms like Hadoop and Spark.
3. Analysis: Employing algorithms for predictive, descriptive, and prescriptive
insights.
• Real-World Example:
o In e-commerce, analytics is used to recommend products based on browsing
history.
4. Basic Requirements for Big Data Analytics
1. Hardware Requirements: High-performance servers and storage systems.
2. Frameworks: Tools like Hadoop and Spark for data storage and processing.
3. Scalable Algorithms: Efficient algorithms for handling large datasets.
4. Expertise: Skilled professionals to manage data pipelines.
5. Big Data Applications
1. Healthcare: Disease outbreak prediction and real-time patient monitoring.
2. Finance: Fraud detection and algorithmic trading.
3. Retail: Targeted marketing and demand forecasting.
4. Transportation: Traffic prediction and route optimization.
6. MapReduce Framework
• Definition: A programming model for processing large-scale data in parallel.
• Phases:
1. Map Phase: Breaks data into key-value pairs.
2. Shuffle and Sort: Groups similar keys together.
3. Reduce Phase: Aggregates data to produce the final result.
Diagram: Refer to the MapReduce Workflow.
7. Algorithms Using MapReduce
• Examples:
1. Word Count: Counts the frequency of each word in a dataset.
2. Sorting: Arranges data in a specific order.
8. NoSQL Databases
• Definition: Non-relational databases optimized for Big Data.
• Types:
1. Key-Value Databases: Efficient for lookup operations (e.g., Redis).
2. Column-Family Databases: Stores data in columns instead of rows (e.g.,
Cassandra).
3. Document Databases: JSON-like documents (e.g., MongoDB).
4. Graph Databases: Nodes and edges represent relationships (e.g., Neo4j).
Diagram: Refer to the SQL vs NoSQL Comparison.