0% found this document useful (0 votes)
22 views95 pages

Advanced Database Totorials 1

The document provides an overview of advanced databases, covering their importance, types (relational and non-relational), and key concepts such as normalization and ACID properties. It discusses various advanced database models, including object-oriented, document-based, and graph databases, as well as SQL optimization techniques like stored procedures and triggers. Additionally, it explores distributed databases, NoSQL databases, and data warehousing concepts, emphasizing their scalability, performance, and use cases in modern applications.

Uploaded by

mitchel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views95 pages

Advanced Database Totorials 1

The document provides an overview of advanced databases, covering their importance, types (relational and non-relational), and key concepts such as normalization and ACID properties. It discusses various advanced database models, including object-oriented, document-based, and graph databases, as well as SQL optimization techniques like stored procedures and triggers. Additionally, it explores distributed databases, NoSQL databases, and data warehousing concepts, emphasizing their scalability, performance, and use cases in modern applications.

Uploaded by

mitchel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 95

Advanced

Databases
BY JACOB ZVIRIKUZHE
MARCH/APRIL 2025
Introduction to Advanced DB
Database:
is a collection of organized information or data that
is stored and managed on a computer
A database is a collection of data stored electronically, like
words, numbers, images, or videos.
It's a software that helps people store, find, and manage
information easily.
Importance of DB in modern
applications.
They support:
Data-driven decision making (business intelligence,
reporting).
Transaction processing (banking systems, online
shopping).
Scalable storage for massive amounts of data (social
media, IoT).
Relational vs. Non-relational
Databases
Relational Databases:
Structure: Data is stored in tables (rows and columns). Each table
has a unique key.
Scalability: Typically scales vertically (increase hardware capacity).
Use Cases: Best suited for structured data and applications
requiring complex queries.
Examples: MySQL, PostgreSQL.
Relational vs. Non-relational
Databases
Non-relational Databases (NoSQL):
Structure: Data is stored in various formats like documents, key-value
pairs or graphs
Scalability: Typically scales horizontally (add more servers).
Use Cases: Ideal for unstructured or semi-structured data, high
scalability, and real-time processing (e.g., social media, big data
applications).
Examples: MongoDB (Document-based)
Review of Basic Concepts
Normalization:

Process of organizing data in a database to reduce redundancy and improve


data integrity.
First Normal Form (1NF)

Second Normal Form (2NF)

Third Normal Form (3NF)

Fourth Normal Form (4NF)


Review of Basic Concepts
ACID Properties:
Atomicity: Ensures that all operations within a transaction are completed
successfully or none at all.
Consistency: Guarantees that a transaction brings the database from one valid
state to another.
Isolation: Ensures that transactions are executed in isolation from one another.
Durability: Guarantees that once a transaction is committed, it remains
permanent, even in the case of a system failure.
Advanced Database Models
Object-Oriented Databases:
Stores data as objects, similar to how data is structured in object-
oriented programming.
Supports inheritance, polymorphism, and encapsulation.
Advanced Database Models
Document-based Databases:
Stores data in document formats such as JSON, BSON, or XML.
Flexible schema and suitable for handling large, unstructured data
sets.
Commonly used in content management systems, e-commerce
platforms (Example: MongoDB).
Advanced Database Models
Graph Databases:
Stores data in the form of nodes, edges, and properties to
represent entities and relationships between them.
Highly efficient in handling data with complex relationships, such
as social networks, fraud detection, and recommendation engines
(Example: Neo4j).
Advanced SQL and Query
Optimization
Structured Query Language
SQL Queries:
SQL is the standard language for accessing and manipulating
databases.
SQL is used to retrieve, insert, update, and delete data from a
relational database.
Example: SELECT * FROM Customers WHERE Country =
'Zimbabwe';
Structured Query Language
Subqueries:
A subquery is a query nested inside another query, often used to refine
results or work with intermediate data.
Subqueries are often used with SELECT, INSERT, UPDATE, or DELETE
commands.
Example: SELECT LastName FROM Customers WHERE CustomerID =
(SELECT CustomerID FROM Orders WHERE OrderID = 1);
This retrieves the name of the customer who placed Order 1 by nesting a query inside
another query.
Stored Procedures and
Triggers
Stored Procedures:
A stored procedure is a precompiled collection of SQL statements that are
stored in the database.
They allow for code reuse and performance optimization.
Example:
DELIMITER // CREATE PROCEDURE GetCustomerOrders (IN CustomerID INT) BEGIN
SELECT * FROM Orders WHERE CustomerID = CustomerID; END // DELIMITER ;

Stored procedures are used to automate repetitive tasks, such as


retrieving data, performing calculations, or updating records.
Stored Procedures and
Triggers
Triggers:
A trigger is an automatic action executed in response to specific
events on a table, such as INSERT, UPDATE, or DELETE.
Triggers help enforce data integrity and audit trails.
Example:
DELIMITER // CREATE TRIGGER UpdateCustomerTimestamp BEFORE
UPDATE ON Customers
FOR EACH ROW BEGIN SET NEW.UpdatedAt = CURRENT_TIMESTAMP;
END // DELIMITER ;
Stored Procedures and
Triggers
Importance of Indexing:
Indexes are used to speed up the retrieval of data from a table by providing
a fast lookup mechanism.
Without indexes, the database must perform a full table scan, examining
every row to find matching records.
Index Types:
 Clustered Index: Alters the physical order of the table and is typically
created on the primary key.
 Non-clustered Index: Does not change the physical order but creates a
logical order for quick access.
Indexing
CREATE INDEX idx_lastname ON
Customers(LastName);

This creates an index on the LastName column, speeding up


queries that filter by last name
How Improper Indexing Affects Performance:

Over-indexing: Creating too many indexes can slow down


INSERT, UPDATE, and DELETE operations because the
database must also update the indexes.
Wrong Indexes: Using inappropriate indexes (e.g., indexing
a column with low selectivity, such as Boolean values) does
not improve performance and wastes resources.
Query Execution Plans

Understanding Query Execution Plans:


A query execution plan is a visual or textual
representation of the steps the database takes to
execute a query.
It shows how tables are accessed, whether indexes
are used, and the cost associated with each operation.
Query Execution Plans
Key Elements:
Table Scans: Occurs when no index is used, requiring the database to
scan the entire table.
Index Scans: When an index is used but not perfectly suited, scanning
a portion of the index.
Index Seeks: The most efficient, where the database can directly seek
specific rows using the index.
Join Operations: Determines how tables are joined. Nested Loops,
Merge Join, and Hash Join are common methods.
Query Execution Plans
Analyzing Performance:
Execution plans help identify performance
bottlenecks:
 Missing indexes can lead to table scans.
 Costly joins may indicate inefficient data retrieval.
 High cost operations may need query optimization.
Query Execution Plans
Example of Viewing a Query Plan

EXPLAIN SELECT * FROM Orders


WHERE OrderID > 100;
Distributed Databases
and Transactions
Distributed Database Architectures
Homogeneous vs. Heterogeneous Distributed Databases:
Distributed databases are designed to store and manage
data across multiple servers or locations.
The key difference between homogeneous and
heterogeneous architectures lies in the uniformity of the
systems involved.
Distributed Database Architectures
Homogeneous Distributed Database:
All sites use the same DBMS (Database Management System).
Data is distributed across nodes with uniform software, making management
easier.
Example: A company with several branches running the same database
software (e.g., MySQL) in each branch.
Advantages:
 Easier administration due to the use of the same DBMS.
 Simpler data integration and querying.
Distributed Database Architectures
Heterogeneous Distributed Database:
Each site can use different DBMSs (e.g., some use MySQL, others use
Oracle).
Data may be stored in different formats, leading to complex
integration.
Example: A corporation that acquired different companies, each using
different database systems.
Advantages:
Allows for greater flexibility in accommodating diverse systems.
Enables the integration of legacy systems with newer databases.
Distributed Database Architectures
Data Distribution and Replication
Data Distribution:
Partitioning: The process of splitting a database into smaller, more
manageable parts that are distributed across multiple locations.
Horizontal Partitioning: Each partition contains a subset of rows.
Vertical Partitioning: Each partition contains a subset of
columns.
Example: In an e-commerce platform, customers from different
regions may have their data stored in different partitions.
Distributed Database Architectures
Replication:
Data replication involves copying and maintaining the same data
across multiple sites.
Distributed Database Architectures
Types of Replication:
 Master-Slave Replication: Data is written to a single master
node and replicated to multiple read-only slave nodes.
 Multi-Master Replication: Allows updates to occur at any node,
and the changes are synchronized across all nodes.
Distributed Database Architectures
Importance of Replication:
 Fault Tolerance: If one node fails, the data is still available from
other nodes.
 Improved Performance: Distributing the load across multiple
nodes speeds up data access and query execution.
 Disaster Recovery: Replicated data ensures minimal downtime
during a system failure.
Distributed Database Architectures
Concurrency Control in Distributed Databases:
Concurrency control ensures that multiple transactions occurring at the same
time do not interfere with each other, maintaining data consistency.
In distributed databases, the complexity increases because transactions might
be happening across different physical locations.
Distributed Database Architectures
Challenges in Concurrency access
Lost update
Dirty Read
Non-Repeatable Reads:
Phantom reads
Deadlocks
Distributed Database Architectures
Key Techniques for Concurrency Control:
1.Locking Mechanisms:
(a) Two-Phase Locking (2PL): Transactions obtain locks on resources
during the first phase and release them in the second phase.
 Shared Lock (Read): Allows multiple transactions to read the
data.
 Exclusive Lock (Write): Only one transaction can modify the
data.
(b) Challenges: Deadlocks can occur when two or more transactions
wait for each other to release locks.
Distributed Database Architectures

 Timestamp Ordering: Transactions are assigned a


timestamp when they begin, and the system ensures that
transactions are executed in order of their timestamps to
prevent conflicts.
 Optimistic Concurrency Control: Transactions are
executed without restrictions initially, and conflicts are
checked only at commit time. If conflicts are detected,
the transaction is rolled back.
Distributed Database Architectures
Two-Phase Commit Protocol
The Two-Phase Commit Protocol (2PC) is used to ensure
atomicity (all-or-nothing) across a distributed system during a
transaction.
It guarantees that either all nodes agree to commit or all
agree to roll back the transaction.
Distributed Database Architectures
Phase 1: Prepare Phase:
The coordinator node sends a message to all participating
nodes, asking them to prepare for the transaction by writing
all changes to a log but not committing.
Each node responds with either a "Ready to Commit" or
"Abort" message.
Distributed Database Architectures
Phase 2: Commit/Abort Phase:
If all nodes respond with "Ready to Commit", the
coordinator sends a commit command, and each node
commits the transaction.
If any node responds with "Abort", the coordinator sends
an abort message to all nodes, and they roll back their
changes.
NoSQL Databases
NoSQL Databases

 NoSQL (Not Only SQL) databases are designed to handle


large volumes of unstructured or semi-structured data.
 They offer scalability and high performance.
 These databases are increasingly popular for applications
requiring real-time processing, big data storage, and
horizontal scaling.
NoSQL Databases
Types of NoSQL Databases.

 Key-Value Databases
 Document Databases
 Column-Family Databases
 Graph Databases

Investigate on the use cases of these.


NoSQL Databases
Use Cases for NoSQL Databases
1.Handling Unstructured or Semi-structured Data:
 NoSQL is ideal for applications that need to store
unstructured (e.g., text, images) or semi-structured data
(e.g., JSON documents).
 Example: A social media platform that stores user posts,
images, and multimedia.
NoSQL Databases
Use Cases for NoSQL Databases
2. Applications Requiring High Scalability:
NoSQL databases are designed to scale horizontally across
distributed systems, making them perfect for applications that need
to grow quickly.
Example: An e-commerce site that experiences large traffic spikes
during sales or holiday seasons.
NoSQL Databases
Use Cases for NoSQL Databases
3. Real-time Big Data Processing:
NoSQL databases handle high-throughput, real-time data
streams effectively.
Example: Internet of Things (IoT) devices generating real-
time sensor data that needs to be stored and analyzed
quickly.
NoSQL Databases
Use Cases for NoSQL Databases
4. High Availability and Fault Tolerance:
NoSQL databases are distributed, offering high availability and
automatic failover in case of node failures.
Example: A global news website that needs to remain operational
even if some of its servers go offline.
NoSQL Databases
Use Cases for NoSQL Databases
5. Graph-based Applications:
NoSQL graph databases efficiently store and query relationships
between entities, which is critical for social networks, logistics, and
fraud detection.
Example: A social network platform that tracks connections and
relationships between users.
NoSQL Databases
Scalability and Performance Considerations
Horizontal Scalability in NoSQL:
NoSQL databases scale horizontally by adding more servers
(nodes) to the database cluster. This contrasts with relational
databases, which typically scale vertically (by increasing
server capacity).
Horizontal scaling allows NoSQL to handle large amounts of
traffic and data without performance degradation.
NoSQL Databases
Scalability and Performance Considerations
Horizontal Scalability in NoSQL
Key Techniques for Horizontal Scaling
Sharding:
The process of breaking a large dataset into smaller, manageable
pieces (shards) and distributing them across multiple servers.
Example: MongoDB shards a collection of documents across
several servers to ensure fast access.
NoSQL Databases
Scalability and Performance Considerations
Horizontal Scalability in NoSQL
Key Techniques for Horizontal Scaling
Replication:
NoSQL databases replicate data across multiple nodes to ensure high
availability and fault tolerance. If one node fails, another node takes over,
minimizing downtime.
Example: Apache Cassandra automatically replicates data across multiple data
centers.
NoSQL Databases
Scalability and Performance Considerations
Horizontal Scalability in NoSQL
Key Techniques for Horizontal Scaling
Eventual Consistency:
Some NoSQL databases prioritize availability and partition tolerance
(from the CAP theorem) over immediate consistency. This means that
after an update, data will eventually become consistent across all nodes.
Example: Amazon DynamoDB provides high availability by allowing
temporary inconsistencies in favor of fast reads/writes.
Data Warehousing and OLAP
(Online Analytical
Processing)
Concepts of Data Warehousing
Data Warehousing:
A data warehouse is a centralized repository that
allows for the storage, analysis, and reporting of large
amounts of historical data from different sources.
It supports business intelligence activities by
providing a foundation for decision-making and
strategic planning.
Concepts of Data Warehousing
Data Marts:
A data mart is a smaller, specialized version of a data
warehouse, focused on a specific business area (e.g., sales,
finance, or marketing).
It allows users to access relevant data without overwhelming
them with the entire data warehouse's contents.
Advantages: Faster access to specific data, reduced complexity,
and improved performance for departmental queries.
Concepts of Data Warehousing
ETL Processes:
ETL (Extract, Transform, Load) is the process of moving data from
source systems into the data warehouse.
Extract: Gathering data from various sources (e.g., databases, flat
files, APIs).
Transform: Cleaning, validating, and converting data into a suitable
format for analysis.
Load: Inserting the transformed data into the data warehouse.
ETL processes ensure that the data is accurate, consistent, and up-
to-date for analysis.
Concepts of Data Warehousing
OLAP vs. OLTP
OLAP (Online Analytical Processing)
•Purpose: Supports complex queries and data analysis. It is optimized
for read-heavy operations, allowing users to explore and analyze
multidimensional data.
Characteristics:
• Aggregates large volumes of historical data.
• Enables users to perform complex calculations and generate
reports quickly.
• Suitable for decision support systems (DSS) and business
intelligence.
Concepts of Data Warehousing
OLTP (Online Transaction Processing):
Purpose: Manages day-to-day transactional data and supports
operational tasks. It is optimized for fast query processing and
maintaining data integrity in real-time.
Characteristics:
 Handles a large number of short online transactions (e.g., order
entry, financial transactions).
 Data is typically highly normalized to reduce redundancy and
ensure consistency.
 Suitable for applications requiring high availability and quick
response times (e.g., e-commerce platforms, banking systems).
Concepts of Data Warehousing
Differences:
Data Structure: OLAP uses denormalized data structures (e.g., star and
snowflake schemas) for fast querying, while OLTP uses normalized structures to
maintain data integrity.
Query Complexity: OLAP queries are complex and involve aggregations, while
OLTP queries are simple and transactional.
User Base: OLAP is used by business analysts and executives for decision-
making, while OLTP is used by operational staff for daily transactions.
Concepts of Data Warehousing
OLAP Cubes:
An OLAP cube is a multidimensional database that allows
users to view and analyze data across multiple dimensions
(e.g., time, geography, product).
Each cube consists of dimensions (attributes) and measures
(metrics) that define the data structure.
Concepts of Data Warehousing
Supporting Multidimensional Analysis:
Dimensions: Represent different perspectives of the data
(e.g., time, location, product category).
Measures: Numerical values that users want to analyze
(e.g., sales revenue, profit margins).
Hierarchies: Within each dimension, hierarchies allow users
to drill down (e.g., from year to month) for more detailed
analysis.
Big Data Technologies
Introduction to Big Data
Defining Big Data:
Big data refers to extremely large datasets that cannot be easily
managed, processed, or analyzed using traditional database systems
and tools.
Key Characteristics (The 3 Vs):
Volume: The sheer size of the data, ranging from terabytes to petabytes.
Velocity: The speed at which data is generated, collected, and processed,
requiring real-time analysis.
Variety: The different types of data generated, including structured, semi-
structured, and unstructured data from various sources (e.g., social media,
sensors, transactions).
Introduction to Big Data
Hadoop Ecosystem:
Apache Hadoop is an open-source framework designed for
distributed storage and processing of big data using
commodity hardware.
Introduction to Big Data
Key Components of Hadoop:
1.HDFS (Hadoop Distributed File System):
A distributed file system that stores data across multiple nodes, providing
high throughput and fault tolerance. Data is split into blocks and replicated
across different nodes to ensure reliability.
2.MapReduce:
A programming model and processing engine that allows for parallel
processing of large datasets across a Hadoop cluster. It consists of two
phases:
 Map Phase: Processes input data and produces key-value pairs.
 Reduce Phase: Aggregates and reduces the key-value pairs to generate
the final output.
Introduction to Big Data
Other Components:
YARN (Yet Another Resource Negotiator): Manages and schedules
resources in the Hadoop cluster.
Hive: A data warehousing tool that allows users to query data using
SQL-like language.
Pig: A platform for processing large datasets with a high-level
scripting language.
Database Security and Privacy
Fundamentals of Database Security
Database Security:
The practice of protecting databases against unauthorized access,
misuse, and data breaches.
Essential for maintaining the integrity, confidentiality, and
availability of data.
User Authentication:
The process of verifying the identity of users attempting to access
the database.
Common methods include usernames and passwords, multi-factor
authentication (MFA), and biometric verification.
Fundamentals of Database Security
User Authorization:
Once authenticated, users are granted permissions to access
specific data or perform certain actions based on their roles.
Role-Based Access Control (RBAC) allows administrators to
define roles and assign permissions accordingly.
Access Control and Encryption
Access Control Techniques:
Mechanisms that restrict access to database resources based on user roles
and permissions.
Types of access control:
 Discretionary Access Control (DAC): Users can grant access to their own
resources.
 Mandatory Access Control (MAC): Access is controlled by a central
authority based on classification levels.
 Role-Based Access Control (RBAC): Access permissions are assigned based
on user roles within the organization
Access Control and Encryption
Encryption:
The process of converting plaintext data into ciphertext to prevent
unauthorized access.
Types of Encryption:
 Data-at-rest encryption: Protects stored data (e.g., using AES
encryption).
 Data-in-transit encryption: Secures data transmitted over
networks (e.g., using TLS/SSL protocols).
Compliance with Data Protection
Regulations
GDPR (General Data Protection Regulation):
•Scope: Applies to all organizations processing personal data of individuals
within the European Union (EU), regardless of where the organization is based.
Key Principles:
• Lawfulness, Fairness, and Transparency: Data must be processed legally
and transparently.
• Data Minimization: Only collect the data that is absolutely necessary for
the intended purpose.
• Accuracy: Data must be kept accurate and up to date.
• Storage Limitation: Data should not be retained longer than necessary.
• Integrity and Confidentiality: Ensure appropriate security to prevent
unauthorized access or breaches.
Compliance with Data Protection
Regulations
HIPAA (Health Insurance Portability and Accountability Act):
Scope: Applies to U.S. healthcare providers, health plans, and
healthcare clearinghouses, as well as their business associates.
Key Safeguards:
 Physical Safeguards: Protect physical access to electronic
systems and facilities.
 Administrative Safeguards: Implement policies and procedures
to manage the selection, development, and use of security
measures.
 Technical Safeguards: Ensure the confidentiality, integrity, and
availability of electronic Protected Health Information (ePHI).
The End

You might also like