0% found this document useful (0 votes)
353 views29 pages

Bcs714d Module 1 Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
353 views29 pages

Bcs714d Module 1 Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

1 Big Data Analytics (BCS714D)

Sub- Big Data Analytics (BCS714D)


MODULE –I
Introduction to BDA

Prepared by
Dr.Bindiya M K
Professor, Dept of CSE

Dept. of CSE, SJBIt 1


2 Big Data Analytics (BCS714D)

Module I
Introduction to Big Data
Classification of Digital Data

Digital data can be broadly categorized into three types: structured, semi-structured, and
unstructured data. This classification is based on how well the data conforms to a predefined
schema or data model, which determines how easily it can be stored, processed, and analyzed
by a computer system.

Structured Data

1. Introduction to Structured Data

Dept. of CSE, SJBIt 2


3 Big Data Analytics (BCS714D)

Structured data refers to highly organized information that is stored and managed in a
predefined schema, usually within Relational Database Management Systems (RDBMS). It is
arranged in tables with rows and columns, making it easily searchable, analysable, and
accessible using SQL queries.

Structured data is widely used in business applications, where data integrity, accuracy, and
relationships between different entities (such as employees and departments) are critical.

2. Characteristics of Structured Data

Structured data exhibits the following key characteristics:

1. Follows a Predefined Schema

o The structure of the data is defined before storing it.

o Data is stored in tables with specific column names and data types.

2. Organized in Rows and Columns

o Each row represents a unique record or entity (e.g., an employee).

o Each column represents an attribute or field (e.g., Employee Name,


Designation).

3. Easily Searchable and Accessible

o SQL queries are used to insert, update, delete, and retrieve data efficiently.

o Indexes and keys speed up searches.

4. Highly Relational

o Relationships exist between different tables through Primary Keys (PK) and
Foreign Keys (FK).

o Data integrity is enforced using constraints like UNIQUE, NOT NULL,


CHECK.

5. Data Consistency and Accuracy

Dept. of CSE, SJBIt 3


4 Big Data Analytics (BCS714D)

o Constraints ensure valid data entry (e.g., Employee ID should always be


unique).

6. Structured Data is Machine-Readable

o Since it follows a fixed format, computers can process it efficiently.

Sources of Structured Data

Structured data is typically stored in Relational Database Management Systems (RDBMS),


which are widely used across industries for storing transactional and operational data.

Common RDBMS Systems for Structured Data

Popular RDBMS solutions include:

 Oracle (Oracle Corp.)

 IBM DB2 (IBM)

 Microsoft SQL Server (Microsoft)

 Greenplum (EMC)

 Teradata (Teradata Corp.)

 MySQL (Open Source)

 PostgreSQL (Advanced Open Source)

These databases primarily store On-Line Transaction Processing (OLTP) data, which consists
of business transactions generated by daily operations.

Example of Structured Data Sources:

 Employee and Payroll Systems

 Customer Transaction Logs

 Online Banking Systems

 Retail Sales Data

 Product Inventory Data

Dept. of CSE, SJBIt 4


5 Big Data Analytics (BCS714D)

4. Example of Structured Data - Employee & Department Relationship

In RDBMS, structured data is stored in tables that can be related to each other through
Primary and Foreign Keys.

 Example of Relationship Between Employee and Department Tables:

Ease of Working with Structured Data

Structured data provides several advantages that make it easy to manage, process, and
analyze.

Data Manipulation Operations (DML)

Structured data supports Data Manipulation Language (DML) operations, which include:

 INSERT: Adding new records.

 UPDATE: Modifying existing records.

 DELETE: Removing records.

Dept. of CSE, SJBIt 5


6 Big Data Analytics (BCS714D)

 SELECT: Retrieving specific records from a database.

Example of an SQL INSERT Statement:

INSERT INTO Employee (EmpNo, EmpName, Designation, DeptNo, ContactNo)

VALUES ('E103', 'John', 'Manager', 'D2', '0888888888');

Security

Structured data can be secured using encryption and tokenization techniques. Organizations
ensure data security by:

 Encrypting sensitive information before storage.

 Applying role-based access control (RBAC) to restrict unauthorized access.

 Using database audit logs to track access history.

Indexing for Fast Data Retrieval

An index is a data structure that speeds up SELECT queries at the cost of additional storage
space.

Example of Creating an Index on EmpNo:

CREATE INDEX idx_empno ON Employee(EmpNo);

Benefit: Increases the performance of search queries significantly.

Scalability of Structured Data

 Traditional RDBMS can be scaled up by increasing:

o Processor speed (CPU)

o Primary storage (RAM)

o Secondary storage (Hard Drives or SSDs)

However, for extremely large datasets, modern businesses use distributed databases like
Google BigQuery and Apache Hadoop.

Transaction Processing & ACID Properties in RDBMS

Dept. of CSE, SJBIt 6


7 Big Data Analytics (BCS714D)

RDBMS supports transaction processing using ACID properties to ensure reliability and
consistency.

ACID Properties:

1. Atomicity – A transaction must be all-or-nothing. If one part fails, the entire


transaction isrolledback.
Example: Transferring money between two bank accounts should complete both
debit and credit steps together.

2. Consistency – The database remains in a valid state before and after a transaction.
Example: An employee cannot be added to a department that does not exist.

3. Isolation – Transactions are executed independently without interference. Example: If


two users book the last seat in a flight, only one should succeed.

4. Durability – Once a transaction is committed, it is permanently saved.


Example: Even if a power failure occurs, confirmed bookings should not be lost.

Advantages of Structured Data

1.Efficient Data Storage – Well-organized format reduces redundancy.


2.FasterQuery Processing – Indexing ensures quick retrieval.
3.Data Integrity & Accuracy – Enforces validation rules.
4.Security & Access Control – Restricts unauthorized access.
5.Scalability & Maintenance – Can be backed up and restored easily.

6. Real-World Applications:

 ATM Transactions – Logs every withdrawal or deposit.

 E-commerce Sales Data – Tracks customer purchases.

 Airline Booking Systems – Maintains flight schedules.

7. Limitations of Structured Data

1. Not Suitable for Unstructured Data – Cannot store videos, images, or social media posts.
2.Schema Rigidity – Requires modification to add new fields.
3.Limited Scalability – RDBMS may struggle with huge volumes of data.

Solution:
Dept. of CSE, SJBIt 7
8 Big Data Analytics (BCS714D)

 NoSQL databases like MongoDB store semi-structured and unstructured data.

 Hadoop and Big Data solutions handle massive datasets efficiently.

Semi-Structured Data

1. Introduction to Semi-Structured Data

Semi-structured data is partially organized data that does not conform to the strict tabular
structure of relational databases but still contains some elements of organization and
hierarchy. It is often referred to as self-describing data because it stores both data and
schema together in a flexible format.

Unlike structured data, which follows a predefined schema (e.g., tables in an RDBMS), semi-
structured data contains tags, labels, or key-value pairs to identify fields, making it more
adaptable for diverse data sources.

Example: XML and JSON files, where data is stored in hierarchical formats with tags and
attributes.

Characteristics of Semi-Structured Data

1. Does Not Conform to Traditional Relational Data Models

 Unlike relational databases, semi-structured data does not use fixed rows and
columns.

 Example: XML files use nested elements rather than tables.


Dept. of CSE, SJBIt 8
9 Big Data Analytics (BCS714D)

2. Uses Tags to Segregate Semantic Elements

 Data is organized using tags, key-value pairs, or hierarchical


structures.
 Example: XML tags (<title>, <author>) help categorize information.

3. No Clear Separation Between Data and Schema

 Schema information is often embedded within the data itself.

 Example: JSON data includes both field names and values in the same file.

4. Supports Hierarchical Structures

 Semi-structured data uses nested records to establish relationships.

 Example: A JSON object can have nested attributes representing a


hierarchical relationship.

5. Flexible Attribute Sets

 Entities in the same dataset may have different attributes or different


orders of attributes.

 Example: In JSON, one record might have an "email" field while another
might not.

3. Examples of Semi-Structured Data

3.1 Real-World Examples

Where do we find semi-structured data?

 Web pages (HTML, XML) – Contain structured tags but store unstructured content.

 Emails – Contain metadata (structured) and message bodies (unstructured).

 JSON and XML Files – Used in APIs and NoSQL databases.

 Sensor Data (IoT Devices) – Logs stored in flexible JSON formats.

 Social Media Data – Twitter messages with metadata (e.g., hashtags, mentions).

Dept. of CSE, SJBIt 9


10 Big Data Analytics (BCS714D)

3.2 Common File Formats for Semi-Structured Data

Format Description Usage

XML (Extensible Markup Uses tags to define Web services (SOAP), Config
Language) elements files

JSON (JavaScript Object Stores data in key-value Web APIs (REST), NoSQL
Notation) pairs databases

HTML (HyperText Markup Defines webpage


Websites, Blogs
Language) structure

YAML (Yet Another Markup Human-readable data


Config files, Kubernetes
Language) format

4. Sources of Semi-Structured Data

The most common sources of semi-structured data are XML and JSON.

4.1 XML (Extensible Markup Language)

 A markup language that stores data in hierarchical, self-descriptive tags.

 Used in SOAP-based web services, configurations, and document storage.

Example of an XML File:

<Book>

<Title>Fundamentals of Business Analytics</Title>

<Author>Seema Acharya</Author>

<Publisher>Wiley India</Publisher>

<Year>2011</Year>

</Book>

1
Dept. of CSE, SJBIt
0
11 Big Data Analytics (BCS714D)

4.2 JSON (JavaScript Object Notation)

 A lightweight data format that stores data in key-value pairs.

 Commonly used in REST APIs, NoSQL databases (MongoDB, CouchDB).

Example of a JSON File:

"BookTitle": "Fundamentals of Business Analytics",

"AuthorName": "Seema Acharya",

"Publisher": "Wiley India",

"YearOfPublication": 2011

5. Comparison: Structured vs. Semi-Structured Data

Feature Structured Data Semi-Structured Data

Schema Predefined Flexible, embedded in data

Storage RDBMS (SQL databases) NoSQL, XML, JSON, Big Data

Querying Uses SQL Uses XPath (XML), JSONPath

Relationships Defined using Primary and Foreign Keys Uses nested structures

Examples Employee databases, bank transactions Emails, JSON APIs, HTML pages

6. Advantages of Semi-Structured Data

1. More Flexible than Structured Data – Schema can evolve over time.

2. Easier to Store than Unstructured Data – Tags help categorize elements.

1
Dept. of CSE, SJBIt
1
12 Big Data Analytics (BCS714D)

3. Self-Describing – Schema and data are stored together.

4. Supports Hierarchical Data – Ideal for complex, nested records.

5. Widely Used in Web Applications – JSON and XML are standard for APIs.

Real-World Use Cases:

 Cloud Storage Services – Google Drive, Dropbox store metadata in JSON/XML.

 E-commerce Product Listings – Product details are stored in flexible JSON files.

 NoSQL Databases (MongoDB, CouchDB) – Use JSON for fast data retrieval.

7. Challenges of Semi-Structured Data

1. Less Efficient than Structured Data – Searching through nested elements is slower.
2. Complex Querying – Requires special tools like XPath, JSONPath.
3. Inconsistent Data Format – Records may not always follow the same structure.

Solution:

 Use NoSQL databases like MongoDB for JSON-based storage.

 Use Big Data frameworks like Hadoop for large-scale XML processing.

1.1.3 Unstructured Data

Unstructured data does not conform to a pre-defined data model. It includes various types of
text and other content formats with unpredictable structures. This type of data constitutes a
significant portion of enterprise data and presents unique challenges in terms of processing
and analysis.

1.1.3.1 Issues with "Unstructured" Data

 Unstructured data lacks a well-defined structure. However, some data categorized as


unstructured may still exhibit an implied structure.

 Example: Text files contain metadata (e.g., file name, creation date), but they are
classified as unstructured because analysis focuses primarily on their content rather
than their properties.

1
Dept. of CSE, SJBIt
2
13 Big Data Analytics (BCS714D)

1.1.3.2 How to Deal with Unstructured Data?

Approximately 80% of enterprise data is unstructured. Therefore, organizations must adopt


various techniques to analyze and extract insights from such data.

1. Data Mining

A set of techniques used to identify patterns and relationships in large datasets using artificial
intelligence, machine learning, statistics, and database systems. Some key algorithms include:

 Association Rule Mining (Market Basket Analysis): Determines frequently co-


occurring items (e.g., customers who buy bread are also likely to buy butter).

 Regression Analysis: Predicts the relationship between dependent and independent


variables.

 Collaborative Filtering: Predicts a user’s preferences based on the preferences of


other similar users (e.g., recommendation systems in e-commerce and streaming
platforms).

2. Text Analytics or Text Mining

 Extracts meaningful insights from text data using statistical pattern learning.

 Common tasks include text categorization, clustering, sentiment analysis, and


entity extraction.

3. Natural Language Processing (NLP)

 Enables computers to understand and process human language.

 Used in applications like chatbots, voice assistants, machine translation, and text
summarization.

4. Noisy Text Analytics

 Involves processing unstructured data that contains spelling errors, abbreviations,


acronyms, missing punctuation, and filler words.

 Common sources include chats, blogs, wikis, emails, and text messages.

5. Manual Tagging with Metadata

1
Dept. of CSE, SJBIt
3
14 Big Data Analytics (BCS714D)

 The process of manually assigning metadata to unstructured data for better


classification and understanding.

 Helps in organizing and retrieving relevant data efficiently.

6. Part-of-Speech (POS) Tagging

 A linguistic technique that assigns grammatical categories to words in a text (e.g.,


noun, verb, adjective).

 Essential for syntactic analysis, speech recognition, and machine translation.

7. Unstructured Information Management Architecture (UIMA)

 An open-source framework developed by IBM for real-time content analytics.

 Used for processing text and other unstructured data to extract hidden meanings and
relationships.

1
Dept. of CSE, SJBIt
4
15 Big Data Analytics (BCS714D)

2.1 Characteristics of Data

Data has three key characteristics: composition, condition, and context, which define its
structure, usability, and relevance.

Composition of Data

Composition refers to the structure, sources, and nature of data. It can be structured, semi-
structured, or unstructured, originating from databases, sensors, social media, or enterprise
systems. Data granularity varies from aggregated reports to detailed transaction logs. It can
be static (unchanging historical records) or real-time streaming (continuously generated from
IoT devices or social media).

Condition of Data

The condition of data determines its quality and readiness for analysis. Raw data may contain
errors, missing values, or duplicates, requiring cleansing and enrichment before use. High-
quality data ensures accurate insights, whereas poor-quality data can lead to misleading
conclusions.

Context of Data

Context provides insight into where, why, and how data was generated. It answers key
questions about its source, purpose, and sensitivity. For example, customer transaction data
differs in significance from medical records, with varying levels of privacy and security
concerns. Understanding context helps ensure ethical and accurate data interpretation.

Small Data vs. Big Data

Small data is well-structured, with known sources and minimal complexity, making it easy to
analyze. Big data, on the other hand, involves high volume, velocity, and variety, often from
multiple unknown sources. It requires advanced techniques such as machine learning and
distributed computing to process effectively.

1
Dept. of CSE, SJBIt
5
16 Big Data Analytics (BCS714D)

2.3

Definition of Big Data

1
Dept. of CSE, SJBIt
6
17 Big Data Analytics (BCS714D)

Big data is high-volume, high-velocity, and high-variety information that requires cost-
effective, innovative processing methods to derive valuable insights for decision-making.

Understanding the Definition

1. High-Volume, High-Velocity, and High-Variety

 Big data consists of massive datasets from multiple sources, including


structured, semi-structured, and unstructured data.

 It is generated at high speed, requiring real-time or near-real-time


processing.

 The variety of data includes text, images, videos, social media posts, and
sensor data.

2. Cost-Effective and Innovative Processing


1
Dept. of CSE, SJBIt
7
18 Big Data Analytics (BCS714D)

 Traditional databases are insufficient for handling big data.


 Organizations adopt distributed computing, cloud storage, and
advanced analytics to process and store large datasets efficiently.

3. Enhanced Insight and Decision-Making

 Big data analytics enables organizations to extract meaningful patterns,


trends, and predictions.

 These insights support faster, data-driven decisions, leading to competitive


advantages and business growth.

1
Dept. of CSE, SJBIt
8
19 Big Data Analytics (BCS714D)

The 3Vs of Big Data

1
Dept. of CSE, SJBIt
9
20 Big Data Analytics (BCS714D)

The 3Vs concept (Volume, Velocity, and Variety) was introduced by Doug Laney in 2001
and is widely used to define big data challenges and opportunities.

2.4 Challenges with Big Data

Despite its potential, big data presents several challenges that organizations must address.

1. Exponential Growth of Data

 The majority of today's data has been generated in the last few years, and its growth
is accelerating.

 Key questions include:

o Which data is useful for analysis?

o Should all data be processed or only a subset?

2
Dept. of CSE, SJBIt
0
21 Big Data Analytics (BCS714D)

o How can valuable insights be separated from noise?

2. Cloud Computing and Virtualization

 Cloud computing offers cost efficiency, scalability, and flexibility for big data
storage and processing.

 However, businesses must decide whether to store data on-premises or in the cloud,
considering security and compliance concerns.

3. Data Retention and Relevance

 Organizations must determine how long to retain data.

 Some data holds long-term value, while others become obsolete within hours.

4. Shortage of Skilled Professionals

 There is a high demand for data science professionals who can manage, analyze,
and interpret big data.

 A lack of expertise in machine learning, artificial intelligence, and data analytics


remains a key challenge.

5. Security and Privacy Concerns

 The risk of data breaches and privacy violations is increasing.

 Organizations must implement strong security measures to protect sensitive


information.

6. Challenges in Data Processing and Visualization

 Handling big data requires efficient methods for:

o Data Capture, Storage, and Processing

o Search, Analysis, and Transfer

o Security and Visualization

 Traditional database systems struggle to handle big data, necessitating new


processing frameworks like Hadoop, Spark, and real-time analytics tools.

2
Dept. of CSE, SJBIt
1
22 Big Data Analytics (BCS714D)

7. Need for Effective Data Visualization

 Visualizing large datasets is essential for extracting insights.

 However, there is a shortage of data visualization experts who can present complex
data in an understandable way.

2.5 What is Big Data?

Big data refers to large, complex datasets that are characterized by high volume, high
velocity, and high variety. These three attributes define how data is generated, processed, and
utilized in modern data-driven environments.

2.5.1 Volume

2
Dept. of CSE, SJBIt
2
23 Big Data Analytics (BCS714D)

The volume of data has expanded exponentially, from kilobytes (KB) and megabytes (MB) to
petabytes (PB), exabytes (EB), and beyond. The sheer scale of data today requires advanced
storage and processing systems to manage it effectively.

Sources of Big Data

Big data originates from a variety of sources, both internal and external to an organization:

1. Internal Data Sources (Within an organization)

Data Storage: Traditional file systems, SQL databases (Oracle, MySQL, PostgreSQL), and
NoSQL databases (MongoDB, Cassandra).

Archives: Scanned documents, customer records, health records, student data, and
organizational reports.

2. External Data Sources (Outside an organization)

2
Dept. of CSE, SJBIt
3
24 Big Data Analytics (BCS714D)

Public Web: Wikipedia, government census data, weather reports, compliance records.

3. Combined Internal & External Sources

Sensor Data: Data from IoT devices, smart meters, car sensors, industrial equipment.

Machine Log Data: Event logs, application logs, business process logs, audit logs, and
clickstream data (user activity on websites).

Social Media: Twitter, Facebook, LinkedIn, Instagram, YouTube, and blogs.

Business Applications: Enterprise Resource Planning (ERP), Customer Relationship


Management (CRM), HR systems, and Google Docs.

Media Files: Images, videos, podcasts, and audio recordings.

Documents: PDF, Excel, Word, CSV, and PowerPoint files.

2.5.2 Velocity

Data is now processed at unprecedented speeds, transitioning from traditional batch


processing to real-time analytics. This evolution can be categorized as:

Batch Processing: Data is collected and processed at scheduled intervals (e.g., payroll
processing).

Periodic Processing: Data updates occur at fixed timeframes (e.g., bank transactions at the
end of a business day).

Near Real-Time Processing: Data is processed within seconds or minutes of generation (e.g.,
stock market updates).

Real-Time Processing: Continuous data flow is processed instantaneously (e.g., fraud


detection, live traffic updates).

2.5.3 Variety

Big data includes a wide range of data types, classified into three main categories:

Structured Data: Highly organized and stored in relational databases (e.g., financial
transactions, customer records).

2
Dept. of CSE, SJBIt
4
25 Big Data Analytics (BCS714D)

Semi-Structured Data: Data with some structure but not as rigid as relational databases (e.g.,
XML, JSON, HTML).

Unstructured Data: Data without a predefined structure, making it harder to store and analyze
(e.g., emails, videos, images, social media posts, PDFs).

2.6 Other Characteristics of Data (Beyond the 3Vs)

In addition to Volume, Velocity, and Variety, data exhibits other important characteristics:

1. Veracity and Validity

Veracity: Refers to the accuracy and reliability of data. Not all collected data is meaningful or
relevant for analysis.

Validity: Ensures that data is correct, clean, and suitable for decision-making.

2. Volatility

Some data remains relevant for long periods (e.g., customer purchase history).

Other data becomes obsolete quickly (e.g., real-time social media trends).

Organizations must define how long to retain data before it loses value.

3. Variability

Data generation is not uniform; fluctuations occur based on business trends.

Example: Retail businesses experience high traffic during festive sales, followed by a slump
in demand.

Systems must be designed to handle these dynamic changes in data flow.

2.7 Why Big Data?

2
Dept. of CSE, SJBIt
5
26 Big Data Analytics (BCS714D)

The more data available, the better the accuracy of analytical insights. Organizations leverage
big data for:

1. Improved Decision-Making

Data-driven insights lead to greater confidence in strategic decisions.

Example: Predictive analytics helps businesses anticipate customer demand.

2. Operational Efficiency

Big data analytics helps optimize processes, reduce waste, and enhance efficiency.

Example: Manufacturing industries use big data to minimize downtime through predictive
maintenance.

3. Cost and Time Reduction

Advanced analytics automates processes, reducing operational costs and time.

Example: Retailers use AI-driven supply chain optimization to reduce excess inventory.

4. Innovation in Products and Services

Insights from big data fuel the development of new products and services.

Example: Streaming services (Netflix, Spotify) use data to recommend personalized content.

5. Enhanced Customer Experience

Big data helps analyze customer behavior and deliver personalized experiences.

2
Dept. of CSE, SJBIt
6
27 Big Data Analytics (BCS714D)

Example: E-commerce platforms use customer data to suggest relevant products.

2.9 Traditional Business Intelligence (BI) vs. Big Data

Key Differences

1.Data Storage & Scalability

Traditional BI: Data is stored in a centralized server (single machine or cluster).

Big Data: Data is stored in a distributed file system (e.g., Hadoop Distributed File System -
HDFS). It scales horizontally (adding more machines), whereas traditional BI scales
vertically (adding more power to an existing machine).

2.Data Processing Mode

Traditional BI: Data analysis is typically performed in offline mode.

Big Data: Supports both real-time and offline analysis.

3.Data Type & Processing

Traditional BI: Works with structured data and moves data to processing functions (move
data to code).

Big Data: Works with structured, semi-structured, and unstructured data and moves
processing functions to data (move code to data).

2.10 A Typical Data Warehouse (DW) Environment

2
Dept. of CSE, SJBIt
7
28 Big Data Analytics (BCS714D)

Data Sources

 Enterprise Resource Planning (ERP) Systems (e.g., SAP, Oracle ERP)


 Customer Relationship Management (CRM) Systems (e.g., Salesforce)
 Legacy Systems
 Third-Party Applications (e.g., external data providers)
 File Formats: RDBMS (Oracle, SQL Server, DB2), Excel (.xls, .xlsx), CSV, text files

Data Integration Process (ETL - Extract, Transform, Load)

 Extraction – Collecting data from multiple sources.


 Transformation – Cleaning, standardizing, and structuring data.
 Loading – Storing the transformed data in a data warehouse (enterprise level) or data
marts (business unit level).

Decision-Making Tools

SQL queries, dashboards, and data mining tools are used for business intelligence and
analytics.

2.11 A Typical Hadoop Environment

Key Differences from Data Warehouse

2
Dept. of CSE, SJBIt
8
29 Big Data Analytics (BCS714D)

 Diverse Data Sources: Web logs, social media, documents (PDFs, text files),
multimedia (audio, video).
 Data Location: Data comes not only from within the company but also from external
sources (e.g., social media, IoT devices).
 Storage: Hadoop Distributed File System (HDFS) for large-scale data storage.
 Processing: Uses MapReduce for distributed data processing.

2
Dept. of CSE, SJBIt
9

You might also like