0% found this document useful (0 votes)
33 views

Parallel Database Systems and Their Architecture

Uploaded by

vinashreemeshram
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Parallel Database Systems and Their Architecture

Uploaded by

vinashreemeshram
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Parallel Database Systems and Their Architecture

• A parallel database system is designed to


distribute data and workload across multiple
processors or nodes in order to improve
performance, scalability, and fault tolerance.
The goal is to handle large datasets and
complex queries efficiently by executing
operations in parallel across multiple
computing resources.
Types of Parallelism in Database Systems

• Parallelism in database systems can be


categorized into two main types:
• Data Parallelism: Data is split across multiple
processors, with each processor handling a
portion of the data.
• Task Parallelism: Different tasks (queries or
operations) are executed in parallel, allowing
multiple queries to run simultaneously.
Key Concepts
• Shared-Nothing Architecture: In this architecture, each node
in the system has its own memory, CPU, and disk storage. No
resources are shared, and nodes communicate with each
other over a network. This is the most scalable approach.
• Shared-Disk Architecture: In this setup, all nodes have
access to a common disk storage, but they still have separate
processors and memory. Data is not replicated, but it is
accessible by all nodes. A shared-disk system tends to scale
well for workloads that require frequent access to large
datasets.
• Shared-Memory Architecture: Here, all nodes share a single
physical memory. Each node typically has its own processor,
but there is a centralized memory that all nodes can access.
This architecture is less scalable than shared-nothing but can
be simpler to implement.
Parallel Database Architectures
• Parallel databases can be implemented using various
architectures, depending on the level of parallelism and the
specific requirements of the application:
• 1. Parallel Database Architectures Based on Data Partitioning
• Horizontal Partitioning (or Range-based Partitioning): The
data is divided into chunks based on specific ranges of values.
For example, a table containing customer data can be
partitioned by customer ID, with each range handled by a
different node.
• Vertical Partitioning: The columns of the database are
distributed across different nodes. This is useful when only a
subset of columns is frequently accessed for specific queries.
• Hybrid Partitioning: A combination of horizontal and vertical
partitioning, allowing for more flexible and optimized data
distribution.
Distributed database
• A distributed database is a collection of data
that is spread across multiple locations, which
could be on different computers, servers, or
geographical regions. Despite its distribution,
the system appears to users as a single unified
database.
Key Features of Distributed Databases:

• Data Distribution:
– Data is stored across multiple physical locations.
– The distribution can be homogeneous (same software and
schema) or heterogeneous (different software or schema).
• Transparency:
– Location Transparency: Users don’t need to know where
data is stored.
– Replication Transparency: Users are unaware of data
replication across nodes.
– Fragmentation Transparency: Users don’t need to worry
about how data is divided across sites.
• Scalability:
• Can handle growing data and user load by adding
more nodes to the system.
• Reliability and Availability:
• If one site fails, the system can continue operating
using other sites.
• Replication ensures data availability.
• Concurrency Control:
• Ensures consistency when multiple users access
the database simultaneously.
• Fault Tolerance:
• Uses mechanisms like replication and recovery
protocols to handle node or system failures.
Types of Distributed Databases:

• Homogeneous Distributed Databases:


– All sites use the same DBMS.
– Simplifies system design but offers less flexibility.
• Heterogeneous Distributed Databases:
– Sites may use different DBMS and schema.
– Requires middleware to manage differences.
• Distributed Data Storage:
– Fragmentation: Dividing data into smaller pieces (fragments).
– Replication: Storing copies of data on multiple sites.
– Combination: Mixing both techniques.
Advantages:

• Improved reliability and availability.


• Faster query responses due to localized data.
• Scalability for larger data volumes.
• Supports collaboration across geographically
dispersed teams.
Distributed Database Architecture
• Distributed Database Architecture in a
Database Management System (DBMS) refers
to the design and structure of how data is
distributed and managed across multiple sites
or nodes. The architecture determines how
the distributed database appears to users,
how the data is stored, and how operations
like querying and transactions are handled
across different locations.
Parallel Query Processing

• Query execution in parallel databases is optimized using multiple


techniques:
• Data Parallelism: Large queries can be broken down into smaller sub
queries, each executed on different processors. For instance, an
aggregate function like SUM or COUNT can be computed in parallel
across partitions of the data.
• Pipeline Parallelism: Queries are broken down into stages, with each
stage running in parallel on different processors. For example, one
processor might perform a JOIN, while another performs a SELECT
operation, and another computes an aggregate.
• Operator Parallelism: Specific query operators such as JOIN, GROUP
BY, or SORT can be executed in parallel across multiple processors or
nodes.
Load Balancing and Query Scheduling

• Efficient load balancing is crucial for parallel databases


to ensure that all processors are working at optimal
capacity. The system may use:
• Workload Distribution: Queries are divided into smaller
tasks, and these tasks are distributed across available
processors in such a way that each processor performs
roughly the same amount of work.
• Dynamic Scheduling: The database management
system may dynamically adjust how tasks are assigned
to nodes based on the current workload or resource
availability.
Example of Parallel Database Systems

• Google Bigtable: A distributed storage system designed to manage large


amounts of data across many machines. It allows parallel access to data by
partitioning tables across nodes in a way that supports both data
parallelism and task parallelism.
• Amazon Redshift: A fully managed, petabyte-scale data warehouse
solution that uses columnar storage and parallel processing to enable fast
query performance on large datasets.
• Apache Hadoop and HBase: Hadoop’s distributed file system (HDFS) and
HBase provide a distributed, parallel environment for processing large
datasets, with HBase being the NoSQL database designed for scalable,
real-time data access.
• Teradata: Teradata uses a massively parallel processing (MPP) architecture
where each node has its own storage and computes queries in parallel
across many nodes, optimizing performance for large-scale data analytics.
Architecture Components of a Parallel Database System

• Processing Nodes:
– These are the CPU resources that execute the queries in
parallel. Depending on the architecture, each node may be
assigned a portion of data or a specific task.
• Storage System:
– The storage system can be implemented using shared-
nothing, shared-disk, or distributed file systems. In the case
of shared-nothing architectures, each node typically has its
own local disk, while in shared-disk systems, the disk storage
is centralized.
• Interconnect Network:
– The network is crucial for communication between nodes. In
shared-nothing architectures, this network handles data
transfers for query results and intermediate computations.
• Query Processor:
– The query processor is responsible for parsing, optimizing, and
executing SQL queries. In parallel databases, it may also include a
parallel execution engine that breaks down queries into tasks that
can be distributed across the nodes.
• Transaction Manager:
– Manages transactions in a parallel database, ensuring that data
consistency and isolation properties are maintained even when data
is spread across multiple nodes.
• Scheduler:
– In parallel databases, the scheduler decides which tasks are
assigned to which processing nodes, and it manages the parallel
execution of queries and data processing.
Advantages of Parallel Database Systems
• Performance: Parallelism allows for the simultaneous
execution of queries and data processing, significantly
reducing query response times, especially for large
datasets.
• Scalability: Parallel systems can handle growing data
volumes by simply adding more processing nodes, which
increases both data storage and computational power.
• Fault Tolerance: By distributing data across multiple
nodes, the system can tolerate the failure of individual
nodes without losing data or interrupting service.
• Improved Throughput: Multiple queries can be executed
concurrently, leading to higher system throughput,
especially in multi-user environments.

You might also like