Bcs714d Module 1 Notes
Bcs714d Module 1 Notes
Prepared by
Dr.Bindiya M K
Professor, Dept of CSE
Module I
Introduction to Big Data
Classification of Digital Data
Digital data can be broadly categorized into three types: structured, semi-structured, and
unstructured data. This classification is based on how well the data conforms to a predefined
schema or data model, which determines how easily it can be stored, processed, and analyzed
by a computer system.
Structured Data
Structured data refers to highly organized information that is stored and managed in a
predefined schema, usually within Relational Database Management Systems (RDBMS). It is
arranged in tables with rows and columns, making it easily searchable, analysable, and
accessible using SQL queries.
Structured data is widely used in business applications, where data integrity, accuracy, and
relationships between different entities (such as employees and departments) are critical.
o Data is stored in tables with specific column names and data types.
o SQL queries are used to insert, update, delete, and retrieve data efficiently.
4. Highly Relational
o Relationships exist between different tables through Primary Keys (PK) and
Foreign Keys (FK).
Greenplum (EMC)
These databases primarily store On-Line Transaction Processing (OLTP) data, which consists
of business transactions generated by daily operations.
In RDBMS, structured data is stored in tables that can be related to each other through
Primary and Foreign Keys.
Structured data provides several advantages that make it easy to manage, process, and
analyze.
Structured data supports Data Manipulation Language (DML) operations, which include:
Security
Structured data can be secured using encryption and tokenization techniques. Organizations
ensure data security by:
An index is a data structure that speeds up SELECT queries at the cost of additional storage
space.
However, for extremely large datasets, modern businesses use distributed databases like
Google BigQuery and Apache Hadoop.
RDBMS supports transaction processing using ACID properties to ensure reliability and
consistency.
ACID Properties:
2. Consistency – The database remains in a valid state before and after a transaction.
Example: An employee cannot be added to a department that does not exist.
6. Real-World Applications:
1. Not Suitable for Unstructured Data – Cannot store videos, images, or social media posts.
2.Schema Rigidity – Requires modification to add new fields.
3.Limited Scalability – RDBMS may struggle with huge volumes of data.
Solution:
Dept. of CSE, SJBIt 7
8 Big Data Analytics (BCS714D)
Semi-Structured Data
Semi-structured data is partially organized data that does not conform to the strict tabular
structure of relational databases but still contains some elements of organization and
hierarchy. It is often referred to as self-describing data because it stores both data and
schema together in a flexible format.
Unlike structured data, which follows a predefined schema (e.g., tables in an RDBMS), semi-
structured data contains tags, labels, or key-value pairs to identify fields, making it more
adaptable for diverse data sources.
Example: XML and JSON files, where data is stored in hierarchical formats with tags and
attributes.
Unlike relational databases, semi-structured data does not use fixed rows and
columns.
Example: JSON data includes both field names and values in the same file.
Example: In JSON, one record might have an "email" field while another
might not.
Web pages (HTML, XML) – Contain structured tags but store unstructured content.
Social Media Data – Twitter messages with metadata (e.g., hashtags, mentions).
XML (Extensible Markup Uses tags to define Web services (SOAP), Config
Language) elements files
JSON (JavaScript Object Stores data in key-value Web APIs (REST), NoSQL
Notation) pairs databases
The most common sources of semi-structured data are XML and JSON.
<Book>
<Author>Seema Acharya</Author>
<Publisher>Wiley India</Publisher>
<Year>2011</Year>
</Book>
1
Dept. of CSE, SJBIt
0
11 Big Data Analytics (BCS714D)
"YearOfPublication": 2011
Relationships Defined using Primary and Foreign Keys Uses nested structures
Examples Employee databases, bank transactions Emails, JSON APIs, HTML pages
1. More Flexible than Structured Data – Schema can evolve over time.
1
Dept. of CSE, SJBIt
1
12 Big Data Analytics (BCS714D)
5. Widely Used in Web Applications – JSON and XML are standard for APIs.
E-commerce Product Listings – Product details are stored in flexible JSON files.
NoSQL Databases (MongoDB, CouchDB) – Use JSON for fast data retrieval.
1. Less Efficient than Structured Data – Searching through nested elements is slower.
2. Complex Querying – Requires special tools like XPath, JSONPath.
3. Inconsistent Data Format – Records may not always follow the same structure.
Solution:
Use Big Data frameworks like Hadoop for large-scale XML processing.
Unstructured data does not conform to a pre-defined data model. It includes various types of
text and other content formats with unpredictable structures. This type of data constitutes a
significant portion of enterprise data and presents unique challenges in terms of processing
and analysis.
Example: Text files contain metadata (e.g., file name, creation date), but they are
classified as unstructured because analysis focuses primarily on their content rather
than their properties.
1
Dept. of CSE, SJBIt
2
13 Big Data Analytics (BCS714D)
1. Data Mining
A set of techniques used to identify patterns and relationships in large datasets using artificial
intelligence, machine learning, statistics, and database systems. Some key algorithms include:
Extracts meaningful insights from text data using statistical pattern learning.
Used in applications like chatbots, voice assistants, machine translation, and text
summarization.
Common sources include chats, blogs, wikis, emails, and text messages.
1
Dept. of CSE, SJBIt
3
14 Big Data Analytics (BCS714D)
Used for processing text and other unstructured data to extract hidden meanings and
relationships.
1
Dept. of CSE, SJBIt
4
15 Big Data Analytics (BCS714D)
Data has three key characteristics: composition, condition, and context, which define its
structure, usability, and relevance.
Composition of Data
Composition refers to the structure, sources, and nature of data. It can be structured, semi-
structured, or unstructured, originating from databases, sensors, social media, or enterprise
systems. Data granularity varies from aggregated reports to detailed transaction logs. It can
be static (unchanging historical records) or real-time streaming (continuously generated from
IoT devices or social media).
Condition of Data
The condition of data determines its quality and readiness for analysis. Raw data may contain
errors, missing values, or duplicates, requiring cleansing and enrichment before use. High-
quality data ensures accurate insights, whereas poor-quality data can lead to misleading
conclusions.
Context of Data
Context provides insight into where, why, and how data was generated. It answers key
questions about its source, purpose, and sensitivity. For example, customer transaction data
differs in significance from medical records, with varying levels of privacy and security
concerns. Understanding context helps ensure ethical and accurate data interpretation.
Small data is well-structured, with known sources and minimal complexity, making it easy to
analyze. Big data, on the other hand, involves high volume, velocity, and variety, often from
multiple unknown sources. It requires advanced techniques such as machine learning and
distributed computing to process effectively.
1
Dept. of CSE, SJBIt
5
16 Big Data Analytics (BCS714D)
2.3
1
Dept. of CSE, SJBIt
6
17 Big Data Analytics (BCS714D)
Big data is high-volume, high-velocity, and high-variety information that requires cost-
effective, innovative processing methods to derive valuable insights for decision-making.
The variety of data includes text, images, videos, social media posts, and
sensor data.
1
Dept. of CSE, SJBIt
8
19 Big Data Analytics (BCS714D)
1
Dept. of CSE, SJBIt
9
20 Big Data Analytics (BCS714D)
The 3Vs concept (Volume, Velocity, and Variety) was introduced by Doug Laney in 2001
and is widely used to define big data challenges and opportunities.
Despite its potential, big data presents several challenges that organizations must address.
The majority of today's data has been generated in the last few years, and its growth
is accelerating.
2
Dept. of CSE, SJBIt
0
21 Big Data Analytics (BCS714D)
Cloud computing offers cost efficiency, scalability, and flexibility for big data
storage and processing.
However, businesses must decide whether to store data on-premises or in the cloud,
considering security and compliance concerns.
Some data holds long-term value, while others become obsolete within hours.
There is a high demand for data science professionals who can manage, analyze,
and interpret big data.
2
Dept. of CSE, SJBIt
1
22 Big Data Analytics (BCS714D)
However, there is a shortage of data visualization experts who can present complex
data in an understandable way.
Big data refers to large, complex datasets that are characterized by high volume, high
velocity, and high variety. These three attributes define how data is generated, processed, and
utilized in modern data-driven environments.
2.5.1 Volume
2
Dept. of CSE, SJBIt
2
23 Big Data Analytics (BCS714D)
The volume of data has expanded exponentially, from kilobytes (KB) and megabytes (MB) to
petabytes (PB), exabytes (EB), and beyond. The sheer scale of data today requires advanced
storage and processing systems to manage it effectively.
Big data originates from a variety of sources, both internal and external to an organization:
Data Storage: Traditional file systems, SQL databases (Oracle, MySQL, PostgreSQL), and
NoSQL databases (MongoDB, Cassandra).
Archives: Scanned documents, customer records, health records, student data, and
organizational reports.
2
Dept. of CSE, SJBIt
3
24 Big Data Analytics (BCS714D)
Public Web: Wikipedia, government census data, weather reports, compliance records.
Sensor Data: Data from IoT devices, smart meters, car sensors, industrial equipment.
Machine Log Data: Event logs, application logs, business process logs, audit logs, and
clickstream data (user activity on websites).
2.5.2 Velocity
Batch Processing: Data is collected and processed at scheduled intervals (e.g., payroll
processing).
Periodic Processing: Data updates occur at fixed timeframes (e.g., bank transactions at the
end of a business day).
Near Real-Time Processing: Data is processed within seconds or minutes of generation (e.g.,
stock market updates).
2.5.3 Variety
Big data includes a wide range of data types, classified into three main categories:
Structured Data: Highly organized and stored in relational databases (e.g., financial
transactions, customer records).
2
Dept. of CSE, SJBIt
4
25 Big Data Analytics (BCS714D)
Semi-Structured Data: Data with some structure but not as rigid as relational databases (e.g.,
XML, JSON, HTML).
Unstructured Data: Data without a predefined structure, making it harder to store and analyze
(e.g., emails, videos, images, social media posts, PDFs).
In addition to Volume, Velocity, and Variety, data exhibits other important characteristics:
Veracity: Refers to the accuracy and reliability of data. Not all collected data is meaningful or
relevant for analysis.
Validity: Ensures that data is correct, clean, and suitable for decision-making.
2. Volatility
Some data remains relevant for long periods (e.g., customer purchase history).
Other data becomes obsolete quickly (e.g., real-time social media trends).
Organizations must define how long to retain data before it loses value.
3. Variability
Example: Retail businesses experience high traffic during festive sales, followed by a slump
in demand.
2
Dept. of CSE, SJBIt
5
26 Big Data Analytics (BCS714D)
The more data available, the better the accuracy of analytical insights. Organizations leverage
big data for:
1. Improved Decision-Making
2. Operational Efficiency
Big data analytics helps optimize processes, reduce waste, and enhance efficiency.
Example: Manufacturing industries use big data to minimize downtime through predictive
maintenance.
Example: Retailers use AI-driven supply chain optimization to reduce excess inventory.
Insights from big data fuel the development of new products and services.
Example: Streaming services (Netflix, Spotify) use data to recommend personalized content.
Big data helps analyze customer behavior and deliver personalized experiences.
2
Dept. of CSE, SJBIt
6
27 Big Data Analytics (BCS714D)
Key Differences
Big Data: Data is stored in a distributed file system (e.g., Hadoop Distributed File System -
HDFS). It scales horizontally (adding more machines), whereas traditional BI scales
vertically (adding more power to an existing machine).
Traditional BI: Works with structured data and moves data to processing functions (move
data to code).
Big Data: Works with structured, semi-structured, and unstructured data and moves
processing functions to data (move code to data).
2
Dept. of CSE, SJBIt
7
28 Big Data Analytics (BCS714D)
Data Sources
Decision-Making Tools
SQL queries, dashboards, and data mining tools are used for business intelligence and
analytics.
2
Dept. of CSE, SJBIt
8
29 Big Data Analytics (BCS714D)
Diverse Data Sources: Web logs, social media, documents (PDFs, text files),
multimedia (audio, video).
Data Location: Data comes not only from within the company but also from external
sources (e.g., social media, IoT devices).
Storage: Hadoop Distributed File System (HDFS) for large-scale data storage.
Processing: Uses MapReduce for distributed data processing.
2
Dept. of CSE, SJBIt
9