0% found this document useful (0 votes)

353 views29 pages

Bcs714d Module 1 Notes

Uploaded by

karunaaggarwal2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

353 views29 pages

Bcs714d Module 1 Notes

Uploaded by

karunaaggarwal2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 29

1 Big Data Analytics (BCS714D)

Sub- Big Data Analytics (BCS714D)

MODULE –I
Introduction to BDA

Prepared by
Dr.Bindiya M K
Professor, Dept of CSE

Dept. of CSE, SJBIt 1

2 Big Data Analytics (BCS714D)

Module I
Introduction to Big Data
Classification of Digital Data

Digital data can be broadly categorized into three types: structured, semi-structured, and
unstructured data. This classification is based on how well the data conforms to a predefined
schema or data model, which determines how easily it can be stored, processed, and analyzed
by a computer system.

Structured Data

1. Introduction to Structured Data

Dept. of CSE, SJBIt 2

3 Big Data Analytics (BCS714D)

Structured data refers to highly organized information that is stored and managed in a
predefined schema, usually within Relational Database Management Systems (RDBMS). It is
arranged in tables with rows and columns, making it easily searchable, analysable, and
accessible using SQL queries.

Structured data is widely used in business applications, where data integrity, accuracy, and
relationships between different entities (such as employees and departments) are critical.

2. Characteristics of Structured Data

Structured data exhibits the following key characteristics:

1. Follows a Predefined Schema

o The structure of the data is defined before storing it.

o Data is stored in tables with specific column names and data types.

2. Organized in Rows and Columns

o Each row represents a unique record or entity (e.g., an employee).

o Each column represents an attribute or field (e.g., Employee Name,

Designation).

3. Easily Searchable and Accessible

o SQL queries are used to insert, update, delete, and retrieve data efficiently.

o Indexes and keys speed up searches.

4. Highly Relational

o Relationships exist between different tables through Primary Keys (PK) and
Foreign Keys (FK).

o Data integrity is enforced using constraints like UNIQUE, NOT NULL,

CHECK.

5. Data Consistency and Accuracy

Dept. of CSE, SJBIt 3

4 Big Data Analytics (BCS714D)

o Constraints ensure valid data entry (e.g., Employee ID should always be

unique).

6. Structured Data is Machine-Readable

o Since it follows a fixed format, computers can process it efficiently.

Sources of Structured Data

Structured data is typically stored in Relational Database Management Systems (RDBMS),

which are widely used across industries for storing transactional and operational data.

Common RDBMS Systems for Structured Data

 Oracle (Oracle Corp.)

 IBM DB2 (IBM)

 Microsoft SQL Server (Microsoft)

 Greenplum (EMC)

 Teradata (Teradata Corp.)

 MySQL (Open Source)

 PostgreSQL (Advanced Open Source)

These databases primarily store On-Line Transaction Processing (OLTP) data, which consists
of business transactions generated by daily operations.

Example of Structured Data Sources:

 Employee and Payroll Systems

 Customer Transaction Logs

 Online Banking Systems

 Retail Sales Data

 Product Inventory Data

Dept. of CSE, SJBIt 4

5 Big Data Analytics (BCS714D)

4. Example of Structured Data - Employee & Department Relationship

In RDBMS, structured data is stored in tables that can be related to each other through
Primary and Foreign Keys.

 Example of Relationship Between Employee and Department Tables:

Ease of Working with Structured Data

Structured data provides several advantages that make it easy to manage, process, and
analyze.

Data Manipulation Operations (DML)

Structured data supports Data Manipulation Language (DML) operations, which include:

 INSERT: Adding new records.

 UPDATE: Modifying existing records.

 DELETE: Removing records.

Dept. of CSE, SJBIt 5

6 Big Data Analytics (BCS714D)

 SELECT: Retrieving specific records from a database.

Example of an SQL INSERT Statement:

INSERT INTO Employee (EmpNo, EmpName, Designation, DeptNo, ContactNo)

VALUES ('E103', 'John', 'Manager', 'D2', '0888888888');

Security

Structured data can be secured using encryption and tokenization techniques. Organizations
ensure data security by:

 Encrypting sensitive information before storage.

 Applying role-based access control (RBAC) to restrict unauthorized access.

 Using database audit logs to track access history.

Indexing for Fast Data Retrieval

An index is a data structure that speeds up SELECT queries at the cost of additional storage
space.

Example of Creating an Index on EmpNo:

CREATE INDEX idx_empno ON Employee(EmpNo);

Benefit: Increases the performance of search queries significantly.

Scalability of Structured Data

 Traditional RDBMS can be scaled up by increasing:

o Processor speed (CPU)

o Primary storage (RAM)

o Secondary storage (Hard Drives or SSDs)

However, for extremely large datasets, modern businesses use distributed databases like
Google BigQuery and Apache Hadoop.

Transaction Processing & ACID Properties in RDBMS

Dept. of CSE, SJBIt 6

7 Big Data Analytics (BCS714D)

RDBMS supports transaction processing using ACID properties to ensure reliability and
consistency.

ACID Properties:

1. Atomicity – A transaction must be all-or-nothing. If one part fails, the entire

transaction isrolledback.
Example: Transferring money between two bank accounts should complete both
debit and credit steps together.

2. Consistency – The database remains in a valid state before and after a transaction.
Example: An employee cannot be added to a department that does not exist.

3. Isolation – Transactions are executed independently without interference. Example: If

two users book the last seat in a flight, only one should succeed.

4. Durability – Once a transaction is committed, it is permanently saved.

Example: Even if a power failure occurs, confirmed bookings should not be lost.

Advantages of Structured Data

1.Efficient Data Storage – Well-organized format reduces redundancy.

2.FasterQuery Processing – Indexing ensures quick retrieval.
3.Data Integrity & Accuracy – Enforces validation rules.
4.Security & Access Control – Restricts unauthorized access.
5.Scalability & Maintenance – Can be backed up and restored easily.

6. Real-World Applications:

 ATM Transactions – Logs every withdrawal or deposit.

 E-commerce Sales Data – Tracks customer purchases.

 Airline Booking Systems – Maintains flight schedules.

7. Limitations of Structured Data

1. Not Suitable for Unstructured Data – Cannot store videos, images, or social media posts.
2.Schema Rigidity – Requires modification to add new fields.
3.Limited Scalability – RDBMS may struggle with huge volumes of data.

Solution:
Dept. of CSE, SJBIt 7
8 Big Data Analytics (BCS714D)

 NoSQL databases like MongoDB store semi-structured and unstructured data.

 Hadoop and Big Data solutions handle massive datasets efficiently.

Semi-Structured Data

1. Introduction to Semi-Structured Data

Semi-structured data is partially organized data that does not conform to the strict tabular
structure of relational databases but still contains some elements of organization and
hierarchy. It is often referred to as self-describing data because it stores both data and
schema together in a flexible format.

Unlike structured data, which follows a predefined schema (e.g., tables in an RDBMS), semi-
structured data contains tags, labels, or key-value pairs to identify fields, making it more
adaptable for diverse data sources.

Example: XML and JSON files, where data is stored in hierarchical formats with tags and
attributes.

Characteristics of Semi-Structured Data

1. Does Not Conform to Traditional Relational Data Models

 Unlike relational databases, semi-structured data does not use fixed rows and
columns.

 Example: XML files use nested elements rather than tables.

Dept. of CSE, SJBIt 8
9 Big Data Analytics (BCS714D)

2. Uses Tags to Segregate Semantic Elements

 Data is organized using tags, key-value pairs, or hierarchical

structures.
 Example: XML tags (<title>, <author>) help categorize information.

3. No Clear Separation Between Data and Schema

 Schema information is often embedded within the data itself.

 Example: JSON data includes both field names and values in the same file.

4. Supports Hierarchical Structures

 Semi-structured data uses nested records to establish relationships.

 Example: A JSON object can have nested attributes representing a

hierarchical relationship.

5. Flexible Attribute Sets

 Entities in the same dataset may have different attributes or different

orders of attributes.

 Example: In JSON, one record might have an "email" field while another
might not.

3. Examples of Semi-Structured Data

3.1 Real-World Examples

Where do we find semi-structured data?

 Web pages (HTML, XML) – Contain structured tags but store unstructured content.

 Emails – Contain metadata (structured) and message bodies (unstructured).

 JSON and XML Files – Used in APIs and NoSQL databases.

 Sensor Data (IoT Devices) – Logs stored in flexible JSON formats.

 Social Media Data – Twitter messages with metadata (e.g., hashtags, mentions).

Dept. of CSE, SJBIt 9

10 Big Data Analytics (BCS714D)

3.2 Common File Formats for Semi-Structured Data

Format Description Usage

XML (Extensible Markup Uses tags to define Web services (SOAP), Config
Language) elements files

JSON (JavaScript Object Stores data in key-value Web APIs (REST), NoSQL
Notation) pairs databases

HTML (HyperText Markup Defines webpage

Websites, Blogs
Language) structure

YAML (Yet Another Markup Human-readable data

Config files, Kubernetes
Language) format

4. Sources of Semi-Structured Data

The most common sources of semi-structured data are XML and JSON.

4.1 XML (Extensible Markup Language)

 A markup language that stores data in hierarchical, self-descriptive tags.

 Used in SOAP-based web services, configurations, and document storage.

Example of an XML File:

<Book>

<Title>Fundamentals of Business Analytics</Title>

<Author>Seema Acharya</Author>

<Publisher>Wiley India</Publisher>

</Book>

1
Dept. of CSE, SJBIt
0
11 Big Data Analytics (BCS714D)

4.2 JSON (JavaScript Object Notation)

 A lightweight data format that stores data in key-value pairs.

 Commonly used in REST APIs, NoSQL databases (MongoDB, CouchDB).

Example of a JSON File:

"BookTitle": "Fundamentals of Business Analytics",

"AuthorName": "Seema Acharya",

"Publisher": "Wiley India",

"YearOfPublication": 2011

5. Comparison: Structured vs. Semi-Structured Data

Feature Structured Data Semi-Structured Data

Schema Predefined Flexible, embedded in data

Storage RDBMS (SQL databases) NoSQL, XML, JSON, Big Data

Querying Uses SQL Uses XPath (XML), JSONPath

Relationships Defined using Primary and Foreign Keys Uses nested structures

Examples Employee databases, bank transactions Emails, JSON APIs, HTML pages

6. Advantages of Semi-Structured Data

1. More Flexible than Structured Data – Schema can evolve over time.

2. Easier to Store than Unstructured Data – Tags help categorize elements.

1
Dept. of CSE, SJBIt
1
12 Big Data Analytics (BCS714D)

3. Self-Describing – Schema and data are stored together.

4. Supports Hierarchical Data – Ideal for complex, nested records.

5. Widely Used in Web Applications – JSON and XML are standard for APIs.

Real-World Use Cases:

 Cloud Storage Services – Google Drive, Dropbox store metadata in JSON/XML.

 E-commerce Product Listings – Product details are stored in flexible JSON files.

 NoSQL Databases (MongoDB, CouchDB) – Use JSON for fast data retrieval.

7. Challenges of Semi-Structured Data

1. Less Efficient than Structured Data – Searching through nested elements is slower.
2. Complex Querying – Requires special tools like XPath, JSONPath.
3. Inconsistent Data Format – Records may not always follow the same structure.

Solution:

 Use NoSQL databases like MongoDB for JSON-based storage.

 Use Big Data frameworks like Hadoop for large-scale XML processing.

1.1.3 Unstructured Data

Unstructured data does not conform to a pre-defined data model. It includes various types of
text and other content formats with unpredictable structures. This type of data constitutes a
significant portion of enterprise data and presents unique challenges in terms of processing
and analysis.

1.1.3.1 Issues with "Unstructured" Data

 Unstructured data lacks a well-defined structure. However, some data categorized as

unstructured may still exhibit an implied structure.

 Example: Text files contain metadata (e.g., file name, creation date), but they are
classified as unstructured because analysis focuses primarily on their content rather
than their properties.

1
Dept. of CSE, SJBIt
2
13 Big Data Analytics (BCS714D)

1.1.3.2 How to Deal with Unstructured Data?

Approximately 80% of enterprise data is unstructured. Therefore, organizations must adopt

various techniques to analyze and extract insights from such data.

1. Data Mining

A set of techniques used to identify patterns and relationships in large datasets using artificial
intelligence, machine learning, statistics, and database systems. Some key algorithms include:

 Association Rule Mining (Market Basket Analysis): Determines frequently co-

occurring items (e.g., customers who buy bread are also likely to buy butter).

 Regression Analysis: Predicts the relationship between dependent and independent

variables.

 Collaborative Filtering: Predicts a user’s preferences based on the preferences of

other similar users (e.g., recommendation systems in e-commerce and streaming
platforms).

2. Text Analytics or Text Mining

 Extracts meaningful insights from text data using statistical pattern learning.

 Common tasks include text categorization, clustering, sentiment analysis, and

entity extraction.

3. Natural Language Processing (NLP)

 Enables computers to understand and process human language.

 Used in applications like chatbots, voice assistants, machine translation, and text
summarization.

4. Noisy Text Analytics

 Involves processing unstructured data that contains spelling errors, abbreviations,

acronyms, missing punctuation, and filler words.

 Common sources include chats, blogs, wikis, emails, and text messages.

5. Manual Tagging with Metadata

1
Dept. of CSE, SJBIt
3
14 Big Data Analytics (BCS714D)

 The process of manually assigning metadata to unstructured data for better

classification and understanding.

 Helps in organizing and retrieving relevant data efficiently.

6. Part-of-Speech (POS) Tagging

 A linguistic technique that assigns grammatical categories to words in a text (e.g.,

noun, verb, adjective).

 Essential for syntactic analysis, speech recognition, and machine translation.

7. Unstructured Information Management Architecture (UIMA)

 An open-source framework developed by IBM for real-time content analytics.

 Used for processing text and other unstructured data to extract hidden meanings and
relationships.

1
Dept. of CSE, SJBIt
4
15 Big Data Analytics (BCS714D)

2.1 Characteristics of Data

Data has three key characteristics: composition, condition, and context, which define its
structure, usability, and relevance.

Composition of Data

Composition refers to the structure, sources, and nature of data. It can be structured, semi-
structured, or unstructured, originating from databases, sensors, social media, or enterprise
systems. Data granularity varies from aggregated reports to detailed transaction logs. It can
be static (unchanging historical records) or real-time streaming (continuously generated from
IoT devices or social media).

Condition of Data

The condition of data determines its quality and readiness for analysis. Raw data may contain
errors, missing values, or duplicates, requiring cleansing and enrichment before use. High-
quality data ensures accurate insights, whereas poor-quality data can lead to misleading
conclusions.

Context of Data

Context provides insight into where, why, and how data was generated. It answers key
questions about its source, purpose, and sensitivity. For example, customer transaction data
differs in significance from medical records, with varying levels of privacy and security
concerns. Understanding context helps ensure ethical and accurate data interpretation.

Small Data vs. Big Data

Small data is well-structured, with known sources and minimal complexity, making it easy to
analyze. Big data, on the other hand, involves high volume, velocity, and variety, often from
multiple unknown sources. It requires advanced techniques such as machine learning and
distributed computing to process effectively.

1
Dept. of CSE, SJBIt
5
16 Big Data Analytics (BCS714D)

2.3

Definition of Big Data

1
Dept. of CSE, SJBIt
6
17 Big Data Analytics (BCS714D)

Big data is high-volume, high-velocity, and high-variety information that requires cost-
effective, innovative processing methods to derive valuable insights for decision-making.

Understanding the Definition

1. High-Volume, High-Velocity, and High-Variety

 Big data consists of massive datasets from multiple sources, including

structured, semi-structured, and unstructured data.

 It is generated at high speed, requiring real-time or near-real-time

processing.

 The variety of data includes text, images, videos, social media posts, and
sensor data.

2. Cost-Effective and Innovative Processing

1
Dept. of CSE, SJBIt
7
18 Big Data Analytics (BCS714D)

 Traditional databases are insufficient for handling big data.

 Organizations adopt distributed computing, cloud storage, and
advanced analytics to process and store large datasets efficiently.

3. Enhanced Insight and Decision-Making

 Big data analytics enables organizations to extract meaningful patterns,

trends, and predictions.

 These insights support faster, data-driven decisions, leading to competitive

advantages and business growth.

1
Dept. of CSE, SJBIt
8
19 Big Data Analytics (BCS714D)

The 3Vs of Big Data

1
Dept. of CSE, SJBIt
9
20 Big Data Analytics (BCS714D)

The 3Vs concept (Volume, Velocity, and Variety) was introduced by Doug Laney in 2001
and is widely used to define big data challenges and opportunities.

2.4 Challenges with Big Data

Despite its potential, big data presents several challenges that organizations must address.

1. Exponential Growth of Data

 The majority of today's data has been generated in the last few years, and its growth
is accelerating.

 Key questions include:

o Which data is useful for analysis?

o Should all data be processed or only a subset?

2
Dept. of CSE, SJBIt
0
21 Big Data Analytics (BCS714D)

o How can valuable insights be separated from noise?

2. Cloud Computing and Virtualization

 Cloud computing offers cost efficiency, scalability, and flexibility for big data
storage and processing.

 However, businesses must decide whether to store data on-premises or in the cloud,
considering security and compliance concerns.

3. Data Retention and Relevance

 Organizations must determine how long to retain data.

 Some data holds long-term value, while others become obsolete within hours.

4. Shortage of Skilled Professionals

 There is a high demand for data science professionals who can manage, analyze,
and interpret big data.

 A lack of expertise in machine learning, artificial intelligence, and data analytics

remains a key challenge.

5. Security and Privacy Concerns

 The risk of data breaches and privacy violations is increasing.

 Organizations must implement strong security measures to protect sensitive

information.

6. Challenges in Data Processing and Visualization

 Handling big data requires efficient methods for:

o Data Capture, Storage, and Processing

o Search, Analysis, and Transfer

o Security and Visualization

 Traditional database systems struggle to handle big data, necessitating new

processing frameworks like Hadoop, Spark, and real-time analytics tools.

2
Dept. of CSE, SJBIt
1
22 Big Data Analytics (BCS714D)

7. Need for Effective Data Visualization

 Visualizing large datasets is essential for extracting insights.

 However, there is a shortage of data visualization experts who can present complex
data in an understandable way.

2.5 What is Big Data?

Big data refers to large, complex datasets that are characterized by high volume, high
velocity, and high variety. These three attributes define how data is generated, processed, and
utilized in modern data-driven environments.

2.5.1 Volume

2
Dept. of CSE, SJBIt
2
23 Big Data Analytics (BCS714D)

The volume of data has expanded exponentially, from kilobytes (KB) and megabytes (MB) to
petabytes (PB), exabytes (EB), and beyond. The sheer scale of data today requires advanced
storage and processing systems to manage it effectively.

Sources of Big Data

Big data originates from a variety of sources, both internal and external to an organization:

1. Internal Data Sources (Within an organization)

Data Storage: Traditional file systems, SQL databases (Oracle, MySQL, PostgreSQL), and
NoSQL databases (MongoDB, Cassandra).

Archives: Scanned documents, customer records, health records, student data, and
organizational reports.

2. External Data Sources (Outside an organization)

2
Dept. of CSE, SJBIt
3
24 Big Data Analytics (BCS714D)

Public Web: Wikipedia, government census data, weather reports, compliance records.

3. Combined Internal & External Sources

Sensor Data: Data from IoT devices, smart meters, car sensors, industrial equipment.

Machine Log Data: Event logs, application logs, business process logs, audit logs, and
clickstream data (user activity on websites).

Social Media: Twitter, Facebook, LinkedIn, Instagram, YouTube, and blogs.

Business Applications: Enterprise Resource Planning (ERP), Customer Relationship

Management (CRM), HR systems, and Google Docs.

Media Files: Images, videos, podcasts, and audio recordings.

Documents: PDF, Excel, Word, CSV, and PowerPoint files.

2.5.2 Velocity

Data is now processed at unprecedented speeds, transitioning from traditional batch

processing to real-time analytics. This evolution can be categorized as:

Batch Processing: Data is collected and processed at scheduled intervals (e.g., payroll
processing).

Periodic Processing: Data updates occur at fixed timeframes (e.g., bank transactions at the
end of a business day).

Near Real-Time Processing: Data is processed within seconds or minutes of generation (e.g.,
stock market updates).

Real-Time Processing: Continuous data flow is processed instantaneously (e.g., fraud

detection, live traffic updates).

2.5.3 Variety

Big data includes a wide range of data types, classified into three main categories:

Structured Data: Highly organized and stored in relational databases (e.g., financial
transactions, customer records).

2
Dept. of CSE, SJBIt
4
25 Big Data Analytics (BCS714D)

Semi-Structured Data: Data with some structure but not as rigid as relational databases (e.g.,
XML, JSON, HTML).

Unstructured Data: Data without a predefined structure, making it harder to store and analyze
(e.g., emails, videos, images, social media posts, PDFs).

2.6 Other Characteristics of Data (Beyond the 3Vs)

In addition to Volume, Velocity, and Variety, data exhibits other important characteristics:

1. Veracity and Validity

Veracity: Refers to the accuracy and reliability of data. Not all collected data is meaningful or
relevant for analysis.

Validity: Ensures that data is correct, clean, and suitable for decision-making.

2. Volatility

Some data remains relevant for long periods (e.g., customer purchase history).

Other data becomes obsolete quickly (e.g., real-time social media trends).

Organizations must define how long to retain data before it loses value.

3. Variability

Data generation is not uniform; fluctuations occur based on business trends.

Example: Retail businesses experience high traffic during festive sales, followed by a slump
in demand.

Systems must be designed to handle these dynamic changes in data flow.

2.7 Why Big Data?

2
Dept. of CSE, SJBIt
5
26 Big Data Analytics (BCS714D)

The more data available, the better the accuracy of analytical insights. Organizations leverage
big data for:

1. Improved Decision-Making

Data-driven insights lead to greater confidence in strategic decisions.

Example: Predictive analytics helps businesses anticipate customer demand.

2. Operational Efficiency

Big data analytics helps optimize processes, reduce waste, and enhance efficiency.

Example: Manufacturing industries use big data to minimize downtime through predictive
maintenance.

3. Cost and Time Reduction

Advanced analytics automates processes, reducing operational costs and time.

Example: Retailers use AI-driven supply chain optimization to reduce excess inventory.

4. Innovation in Products and Services

Insights from big data fuel the development of new products and services.

Example: Streaming services (Netflix, Spotify) use data to recommend personalized content.

5. Enhanced Customer Experience

Big data helps analyze customer behavior and deliver personalized experiences.

2
Dept. of CSE, SJBIt
6
27 Big Data Analytics (BCS714D)

Example: E-commerce platforms use customer data to suggest relevant products.

2.9 Traditional Business Intelligence (BI) vs. Big Data

Key Differences

1.Data Storage & Scalability

Traditional BI: Data is stored in a centralized server (single machine or cluster).

Big Data: Data is stored in a distributed file system (e.g., Hadoop Distributed File System -
HDFS). It scales horizontally (adding more machines), whereas traditional BI scales
vertically (adding more power to an existing machine).

2.Data Processing Mode

Traditional BI: Data analysis is typically performed in offline mode.

Big Data: Supports both real-time and offline analysis.

3.Data Type & Processing

Traditional BI: Works with structured data and moves data to processing functions (move
data to code).

Big Data: Works with structured, semi-structured, and unstructured data and moves
processing functions to data (move code to data).

2.10 A Typical Data Warehouse (DW) Environment

2
Dept. of CSE, SJBIt
7
28 Big Data Analytics (BCS714D)

Data Sources

 Enterprise Resource Planning (ERP) Systems (e.g., SAP, Oracle ERP)

 Customer Relationship Management (CRM) Systems (e.g., Salesforce)
 Legacy Systems
 Third-Party Applications (e.g., external data providers)
 File Formats: RDBMS (Oracle, SQL Server, DB2), Excel (.xls, .xlsx), CSV, text files

Data Integration Process (ETL - Extract, Transform, Load)

 Extraction – Collecting data from multiple sources.

 Transformation – Cleaning, standardizing, and structuring data.
 Loading – Storing the transformed data in a data warehouse (enterprise level) or data
marts (business unit level).

Decision-Making Tools

SQL queries, dashboards, and data mining tools are used for business intelligence and
analytics.

2.11 A Typical Hadoop Environment

Key Differences from Data Warehouse

2
Dept. of CSE, SJBIt
8
29 Big Data Analytics (BCS714D)

 Diverse Data Sources: Web logs, social media, documents (PDFs, text files),
multimedia (audio, video).
 Data Location: Data comes not only from within the company but also from external
sources (e.g., social media, IoT devices).
 Storage: Hadoop Distributed File System (HDFS) for large-scale data storage.
 Processing: Uses MapReduce for distributed data processing.

2
Dept. of CSE, SJBIt
9

DBMS (UNIT-6) (Advances in Databases and Big Data)
No ratings yet
DBMS (UNIT-6) (Advances in Databases and Big Data)
103 pages
Software Engineering PPT Mod 2
No ratings yet
Software Engineering PPT Mod 2
30 pages
SEPM Chapter 1 (Module 3)
No ratings yet
SEPM Chapter 1 (Module 3)
16 pages
DBMS Viva Questions
No ratings yet
DBMS Viva Questions
31 pages
Analytic Ability and Digital Awareness Syllabus
0% (1)
Analytic Ability and Digital Awareness Syllabus
2 pages
Advanced Database Management System MCQ With Answers - 071802
No ratings yet
Advanced Database Management System MCQ With Answers - 071802
21 pages
Compiler Design for B.Tech Students
No ratings yet
Compiler Design for B.Tech Students
21 pages
Assignment Data Science
No ratings yet
Assignment Data Science
3 pages
Software Engineering
No ratings yet
Software Engineering
3 pages
Agile Requirements Methods: Dean Leffingwell
No ratings yet
Agile Requirements Methods: Dean Leffingwell
13 pages
Oop Mcqs Unit II
No ratings yet
Oop Mcqs Unit II
12 pages
SCT - QB - Anwers - p1
No ratings yet
SCT - QB - Anwers - p1
53 pages
UGC NET Software Engineering Questions
No ratings yet
UGC NET Software Engineering Questions
16 pages
Anna University CP 5005-SOFTWARE QUALITY ASSURANCE AND TESTING
No ratings yet
Anna University CP 5005-SOFTWARE QUALITY ASSURANCE AND TESTING
3 pages
AI Unit 1.
No ratings yet
AI Unit 1.
15 pages
3-1 Bigdata (Spark)
No ratings yet
3-1 Bigdata (Spark)
3 pages
MCQs On DS
No ratings yet
MCQs On DS
13 pages
C Language Quiz: Questions & Answers
65% (23)
C Language Quiz: Questions & Answers
92 pages
Cs2358 Internet Programming Lab Anna University Syllabus
No ratings yet
Cs2358 Internet Programming Lab Anna University Syllabus
12 pages
Vtu Etr 2020
No ratings yet
Vtu Etr 2020
15 pages
DMGT Bits 2025 Updated
No ratings yet
DMGT Bits 2025 Updated
9 pages
Mg8591 Pom All MCQ 2
0% (1)
Mg8591 Pom All MCQ 2
16 pages
CS8086 Soft Computing Syllabus 2017
No ratings yet
CS8086 Soft Computing Syllabus 2017
5 pages
UNIT 1 INTRODUCTION TO BIGDATA by MIT
No ratings yet
UNIT 1 INTRODUCTION TO BIGDATA by MIT
12 pages
IT2202 - OPERATING SYSTEMS Handout
No ratings yet
IT2202 - OPERATING SYSTEMS Handout
5 pages
Object-Oriented Analysis & Design Q&A
0% (1)
Object-Oriented Analysis & Design Q&A
41 pages
B.tech R23 CSE Course Struture & II Year Syllabus
No ratings yet
B.tech R23 CSE Course Struture & II Year Syllabus
35 pages
Unit 1 Discrete Maths
No ratings yet
Unit 1 Discrete Maths
1 page
AI Based On Machine Learning Methods Research Paper
No ratings yet
AI Based On Machine Learning Methods Research Paper
17 pages
Chapter 6 Measures of Skewness and Kurtosis
No ratings yet
Chapter 6 Measures of Skewness and Kurtosis
25 pages
BI Module 4 Notes
No ratings yet
BI Module 4 Notes
31 pages
Uid Notes
No ratings yet
Uid Notes
10 pages
University Question Paper Feedback Form
0% (1)
University Question Paper Feedback Form
2 pages
Imp Questions For RGPV
No ratings yet
Imp Questions For RGPV
4 pages
SD Unit 1
No ratings yet
SD Unit 1
30 pages
SEPM MCQ Question Bank for CSE
No ratings yet
SEPM MCQ Question Bank for CSE
20 pages
CS2029-Advanced Database Technology
No ratings yet
CS2029-Advanced Database Technology
18 pages
Database Management Systems Jan 2014
No ratings yet
Database Management Systems Jan 2014
2 pages
Greedy Algorithm DAA
No ratings yet
Greedy Algorithm DAA
22 pages
DS GTU Study Material Presentations Unit-1
No ratings yet
DS GTU Study Material Presentations Unit-1
14 pages
BCS501 - Question Bank
No ratings yet
BCS501 - Question Bank
6 pages
TRB Rejinpaul Question Papets
No ratings yet
TRB Rejinpaul Question Papets
12 pages
B.Tech Exam 2024: Data Warehousing & Mining
No ratings yet
B.Tech Exam 2024: Data Warehousing & Mining
4 pages
MC Unit-5
No ratings yet
MC Unit-5
43 pages
High Level MCQ Genetic Algorithms
No ratings yet
High Level MCQ Genetic Algorithms
5 pages
PHP PROGRAMMING LAB Manual-Dr.S.Vasuki
No ratings yet
PHP PROGRAMMING LAB Manual-Dr.S.Vasuki
17 pages
Understanding Java Servlets and Containers
No ratings yet
Understanding Java Servlets and Containers
105 pages
OOAD 2 Marks
No ratings yet
OOAD 2 Marks
20 pages
Object Oriented Programming in C++
No ratings yet
Object Oriented Programming in C++
4 pages
Data Structures Tutorial Guide
No ratings yet
Data Structures Tutorial Guide
2 pages
DBMS SQL Viva Questions With Answers BCA Sem2
No ratings yet
DBMS SQL Viva Questions With Answers BCA Sem2
4 pages
BA - Module 1
No ratings yet
BA - Module 1
27 pages
Search Sort MCQ Gate
No ratings yet
Search Sort MCQ Gate
12 pages
Syllabus - Social, Web and Mobile Analytics
No ratings yet
Syllabus - Social, Web and Mobile Analytics
7 pages
VTU Exam Question Paper With Solution of 18CS35 Software Engineering May-2021-Dr. Savitha Hiremath
No ratings yet
VTU Exam Question Paper With Solution of 18CS35 Software Engineering May-2021-Dr. Savitha Hiremath
26 pages
Ann-Unit I
No ratings yet
Ann-Unit I
40 pages
Classification of Digital Data
No ratings yet
Classification of Digital Data
42 pages
Bda Unit 1
No ratings yet
Bda Unit 1
24 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
48 pages
Unit-1 Bda
No ratings yet
Unit-1 Bda
17 pages
Ra Sapnotes Res DWLD
No ratings yet
Ra Sapnotes Res DWLD
24 pages
Introduction to NoSQL Databases
No ratings yet
Introduction to NoSQL Databases
43 pages
Statement Call in Java
No ratings yet
Statement Call in Java
17 pages
Azure Certification Guide
No ratings yet
Azure Certification Guide
11 pages
SQL Ultimate Cheat Sheet
No ratings yet
SQL Ultimate Cheat Sheet
9 pages
Unit 3 Hbase, Mongodb and Couch DB
No ratings yet
Unit 3 Hbase, Mongodb and Couch DB
12 pages
Ai-102 Guide Lat4
No ratings yet
Ai-102 Guide Lat4
5 pages
Monitoring and Administering Database
80% (5)
Monitoring and Administering Database
39 pages
DBMS Scheme CIE 2 5th Sem
No ratings yet
DBMS Scheme CIE 2 5th Sem
10 pages
Example
No ratings yet
Example
1 page
Splunk Fundamentals 3: Course Topics
No ratings yet
Splunk Fundamentals 3: Course Topics
1 page
Oracle DBA Course Syllabus
No ratings yet
Oracle DBA Course Syllabus
3 pages
FortiAnalyzer 05 Reports
No ratings yet
FortiAnalyzer 05 Reports
59 pages
SAP Directories
No ratings yet
SAP Directories
3 pages
Power BI
No ratings yet
Power BI
144 pages
BC Assignment No. 1
No ratings yet
BC Assignment No. 1
3 pages
Database A1 Presentation and Activity 1
No ratings yet
Database A1 Presentation and Activity 1
16 pages
CSE22144 Lecture Notes
No ratings yet
CSE22144 Lecture Notes
4 pages
Hotel Room Management System Code
No ratings yet
Hotel Room Management System Code
5 pages
MongoDB JSON Schema Guide
No ratings yet
MongoDB JSON Schema Guide
8 pages
Experiment No 2
No ratings yet
Experiment No 2
3 pages
Execution Plan Basics - Content
No ratings yet
Execution Plan Basics - Content
4 pages
Romney Ais13 PPT 03
No ratings yet
Romney Ais13 PPT 03
18 pages
Bcomdbmsrecord
No ratings yet
Bcomdbmsrecord
24 pages
1000 SAP ABAP Interview Questions and Answers: The Best Tech For All You Do
No ratings yet
1000 SAP ABAP Interview Questions and Answers: The Best Tech For All You Do
46 pages
Final Exam-قواعد بيانات متقدمة - UPINAR - University of Palestine
No ratings yet
Final Exam-قواعد بيانات متقدمة - UPINAR - University of Palestine
1 page
Oracle PL/SQL Lab Guide
0% (1)
Oracle PL/SQL Lab Guide
38 pages
Laravel Guide
No ratings yet
Laravel Guide
3 pages
Chapter 6 - Simple Queries in SQL
No ratings yet
Chapter 6 - Simple Queries in SQL
164 pages
Salesforce
100% (1)
Salesforce
50 pages

Bcs714d Module 1 Notes

Uploaded by

Bcs714d Module 1 Notes

Uploaded by

1 Big Data Analytics (BCS714D)

Sub- Big Data Analytics (BCS714D)

Dept. of CSE, SJBIt 1

1. Introduction to Structured Data

Dept. of CSE, SJBIt 2

2. Characteristics of Structured Data

Structured data exhibits the following key characteristics:

1. Follows a Predefined Schema

o The structure of the data is defined before storing it.

2. Organized in Rows and Columns

o Each row represents a unique record or entity (e.g., an employee).

o Each column represents an attribute or field (e.g., Employee Name,

3. Easily Searchable and Accessible

o Indexes and keys speed up searches.

o Data integrity is enforced using constraints like UNIQUE, NOT NULL,

5. Data Consistency and Accuracy

Dept. of CSE, SJBIt 3

o Constraints ensure valid data entry (e.g., Employee ID should always be

6. Structured Data is Machine-Readable

o Since it follows a fixed format, computers can process it efficiently.

Sources of Structured Data

Structured data is typically stored in Relational Database Management Systems (RDBMS),

Common RDBMS Systems for Structured Data

Popular RDBMS solutions include:

 Oracle (Oracle Corp.)

 IBM DB2 (IBM)

 Microsoft SQL Server (Microsoft)

 Teradata (Teradata Corp.)

 MySQL (Open Source)

 PostgreSQL (Advanced Open Source)

Example of Structured Data Sources:

 Employee and Payroll Systems

 Customer Transaction Logs

 Online Banking Systems

 Retail Sales Data

 Product Inventory Data

Dept. of CSE, SJBIt 4

4. Example of Structured Data - Employee & Department Relationship

 Example of Relationship Between Employee and Department Tables:

Ease of Working with Structured Data

Data Manipulation Operations (DML)

 INSERT: Adding new records.

 UPDATE: Modifying existing records.

 DELETE: Removing records.

Dept. of CSE, SJBIt 5

 SELECT: Retrieving specific records from a database.

Example of an SQL INSERT Statement:

INSERT INTO Employee (EmpNo, EmpName, Designation, DeptNo, ContactNo)

VALUES ('E103', 'John', 'Manager', 'D2', '0888888888');

 Encrypting sensitive information before storage.

 Applying role-based access control (RBAC) to restrict unauthorized access.

 Using database audit logs to track access history.

Indexing for Fast Data Retrieval

Example of Creating an Index on EmpNo:

CREATE INDEX idx_empno ON Employee(EmpNo);

Benefit: Increases the performance of search queries significantly.

Scalability of Structured Data

 Traditional RDBMS can be scaled up by increasing:

o Processor speed (CPU)

o Primary storage (RAM)

o Secondary storage (Hard Drives or SSDs)

Transaction Processing & ACID Properties in RDBMS

Dept. of CSE, SJBIt 6

1. Atomicity – A transaction must be all-or-nothing. If one part fails, the entire

3. Isolation – Transactions are executed independently without interference. Example: If

4. Durability – Once a transaction is committed, it is permanently saved.

Advantages of Structured Data

1.Efficient Data Storage – Well-organized format reduces redundancy.

 ATM Transactions – Logs every withdrawal or deposit.

 E-commerce Sales Data – Tracks customer purchases.

 Airline Booking Systems – Maintains flight schedules.

7. Limitations of Structured Data

 NoSQL databases like MongoDB store semi-structured and unstructured data.

 Hadoop and Big Data solutions handle massive datasets efficiently.

1. Introduction to Semi-Structured Data

Characteristics of Semi-Structured Data