Parallel Database Systems and Their Architecture
Parallel Database Systems and Their Architecture
• Data Distribution:
– Data is stored across multiple physical locations.
– The distribution can be homogeneous (same software and
schema) or heterogeneous (different software or schema).
• Transparency:
– Location Transparency: Users don’t need to know where
data is stored.
– Replication Transparency: Users are unaware of data
replication across nodes.
– Fragmentation Transparency: Users don’t need to worry
about how data is divided across sites.
• Scalability:
• Can handle growing data and user load by adding
more nodes to the system.
• Reliability and Availability:
• If one site fails, the system can continue operating
using other sites.
• Replication ensures data availability.
• Concurrency Control:
• Ensures consistency when multiple users access
the database simultaneously.
• Fault Tolerance:
• Uses mechanisms like replication and recovery
protocols to handle node or system failures.
Types of Distributed Databases:
• Processing Nodes:
– These are the CPU resources that execute the queries in
parallel. Depending on the architecture, each node may be
assigned a portion of data or a specific task.
• Storage System:
– The storage system can be implemented using shared-
nothing, shared-disk, or distributed file systems. In the case
of shared-nothing architectures, each node typically has its
own local disk, while in shared-disk systems, the disk storage
is centralized.
• Interconnect Network:
– The network is crucial for communication between nodes. In
shared-nothing architectures, this network handles data
transfers for query results and intermediate computations.
• Query Processor:
– The query processor is responsible for parsing, optimizing, and
executing SQL queries. In parallel databases, it may also include a
parallel execution engine that breaks down queries into tasks that
can be distributed across the nodes.
• Transaction Manager:
– Manages transactions in a parallel database, ensuring that data
consistency and isolation properties are maintained even when data
is spread across multiple nodes.
• Scheduler:
– In parallel databases, the scheduler decides which tasks are
assigned to which processing nodes, and it manages the parallel
execution of queries and data processing.
Advantages of Parallel Database Systems
• Performance: Parallelism allows for the simultaneous
execution of queries and data processing, significantly
reducing query response times, especially for large
datasets.
• Scalability: Parallel systems can handle growing data
volumes by simply adding more processing nodes, which
increases both data storage and computational power.
• Fault Tolerance: By distributing data across multiple
nodes, the system can tolerate the failure of individual
nodes without losing data or interrupting service.
• Improved Throughput: Multiple queries can be executed
concurrently, leading to higher system throughput,
especially in multi-user environments.