Unit 1: Introduction to Big Data
1. Definition of Big Data
Big Data refers to large and complex data sets that cannot be processed efficiently using traditional data
processing tools. It involves capturing, storing, managing, and analyzing huge volumes of data to extract
valuable insights.
2. Characteristics of Big Data (5Vs)
1. Volume - Large amounts of data (terabytes to zettabytes).
2. Velocity - High speed of data generation (real-time or near-real-time).
3. Variety - Different data formats (text, images, videos, logs, etc.).
4. Veracity - Data quality, accuracy, and trustworthiness.
5. Value - Usefulness of the data in decision-making.
3. Features of Big Data
- Scalability - Systems must scale horizontally to manage large data sets.
- Fault Tolerance - Systems should handle failures gracefully.
- Distributed Storage - Data is stored across multiple machines.
- Parallel Processing - Tasks are executed concurrently across nodes.
- Real-Time Analysis - Insights can be extracted in real-time or near real-time.
- Cost Efficiency - Uses commodity hardware and open-source tools like Hadoop.
- Flexibility - Supports multiple data formats and sources.
- Data Redundancy - Ensures data availability through replication.
4. Types of Digital Data
- Structured - Tabular data (e.g., SQL databases).
- Semi-Structured - Partially organized (e.g., XML, JSON).
- Unstructured - No predefined format (e.g., videos, emails, social media posts).
5. Traditional vs Big Data Systems
Unit 1: Introduction to Big Data
Traditional Systems vs Big Data Systems:
Storage: Centralized vs Distributed
Processing: Batch vs Batch & Real-Time
Data Types: Structured vs All types
Scalability: Vertical vs Horizontal
Cost: Expensive vs Cost-effective
6. Technologies Supporting Big Data
- Hadoop - Distributed storage and processing.
- MapReduce - Programming model for parallel data processing.
- Spark - In-memory, faster processing framework.
- NoSQL - MongoDB, Cassandra for flexible data models.
7. Applications of Big Data
- Healthcare - Patient analytics, disease prediction.
- Retail - Customer behavior prediction.
- Banking - Fraud detection, risk analysis.
- Government - Smart cities, public safety.
- Social Media - Trend analysis, sentiment mining.