Q1.
Define Big Data & explain its characteristics in detail
Big Data refers to large and complex datasets that cannot be efficiently stored, processed, or
analyzed using traditional database management systems or tools. These datasets are
characterized by their massive size, high speed of generation, and diverse formats. The concept of
Big Data is not just about handling large volumes of data, but also about extracting meaningful
insights and patterns to support decision-making. Characteristics of Big Data (5Vs): - Volume: Huge
amount of data from multiple sources (e.g., Facebook generates petabytes daily). - Velocity: Speed
of data generation and processing (e.g., stock market transactions). - Variety: Structured,
semi-structured, and unstructured (e.g., databases, JSON, videos). - Veracity: Accuracy and
reliability of data (e.g., filtering false data from social media). - Value: Extracting useful insights for
decision making (e.g., personalized recommendations). Additional: Variability, Visualization.
Importance: Helps in designing systems, ensures accurate analytics, and provides business value.
Q2. Classify the types of data with examples
Types of Data: 1. Structured Data – Tabular, easily stored in RDBMS (e.g., bank transactions). 2.
Semi-Structured Data – Partial organization (e.g., XML, JSON, emails). 3. Unstructured Data – No
predefined structure (e.g., videos, social media posts). Other classifications: - Quantitative vs
Qualitative - Real-time vs Batch Data Importance: Enables efficient storage, analysis, and better
decision-making.
Q3. Differentiate between Traditional Business Intelligence (TBI) & Big Data
(BD)
Traditional BI vs Big Data: - Data type: Structured vs Structured+Unstructured - Volume: MB–GB vs
TB–PB+ - Velocity: Batch vs Real-time - Tools: RDBMS, OLAP vs Hadoop, Spark, NoSQL -
Flexibility: Rigid schema vs Highly flexible - Cost: High vs Lower (open-source) - Use cases:
Reports vs Predictive, fraud detection Summary: TBI answers 'what happened' while Big Data
answers 'what, why, and what next'.
Q4. Explain architecture of Hadoop environment with neat diagram
Hadoop Architecture: - HDFS: Distributed storage (NameNode, DataNode). - YARN: Resource
management (Resource Manager, Node Manager). - MapReduce: Processing model (Map phase,
Reduce phase). - Ecosystem Tools: Hive, Pig, HBase, Sqoop, Flume. Diagram: [NameNode +
DataNodes for storage, YARN for resource allocation, MapReduce/Spark for processing, Hive/Pig
for querying].
Q5. Classify the types of analytics with examples
Types of Analytics: 1. Descriptive – What happened? (e.g., sales reports). 2. Diagnostic – Why did
it happen? (e.g., reasons for sales drop). 3. Predictive – What will happen? (e.g., churn prediction).
4. Prescriptive – What should be done? (e.g., best marketing strategy). Summary: Descriptive =
Past, Diagnostic = Cause, Predictive = Future, Prescriptive = Action.
Q6. Explain the importance of Big Data analytics for organizations
Importance: - Better decision making. - Customer insights and personalization. - Operational
efficiency, cost reduction. - Fraud detection in real-time. - Risk management and product
development. - Competitive advantage. Example: Walmart analyzes millions of transactions daily to
optimize supply chain.
Q7. List and briefly explain the technologies used in Big Data environment
Technologies: - Hadoop ecosystem (HDFS, MapReduce, Hive, Pig, HBase). - Apache Spark
(in-memory processing). - NoSQL databases (MongoDB, Cassandra, HBase). - Data ingestion
tools (Kafka, Flume, Sqoop). - Data warehousing (Redshift, BigQuery). - Visualization (Tableau,
Power BI). - ML frameworks (MLlib, TensorFlow). - Cloud platforms (AWS, Azure, GCP).
Q8. Explain CAP Theorem
CAP Theorem: In distributed systems, only 2 out of 3 are possible: - Consistency: Every read gets
latest write. - Availability: Every request receives a response. - Partition Tolerance: System works
despite network failures. Examples: - CP: HBase, MongoDB - AP: Cassandra, CouchDB - CA:
Rare, as partitions are inevitable.
Q9. What are the terminologies used in Big Data?
Terminologies: - Data Lake: Repository for all data types. - ETL: Extract, Transform, Load process.
- Data Warehouse: Structured storage. - Cluster: Group of machines. - Node: Single machine in
cluster. - Schema-on-Read: Structure defined at reading. - Data Mining: Pattern discovery. -
Streaming Data: Continuous flow from IoT, sensors. - Machine Learning: Algorithms for predictive
insights.
Q10. Explain NoSQL in Big Data
NoSQL Databases: Definition: Schema-less, scalable databases for structured + unstructured data.
Types: - Document-based (MongoDB, CouchDB). - Column-based (Cassandra, HBase). -
Key-Value (Redis, DynamoDB). - Graph (Neo4j). Advantages: - High scalability, schema-less, fast
performance. Use Cases: - Social media analytics, recommendation engines, fraud detection.