Lecture 6 - Document Databases, Data Formats
Lecture 6 - Document Databases, Data Formats
A database administrator might choose embedding over referencing in MongoDB to achieve faster read operations by storing all related data in a single document, which can be beneficial for one-to-one or one-to-many relationships where the child documents share the same parent . The main advantage of embedding is the ability to manipulate related data in a single database operation, reducing the query complexity. However, the trade-offs include potentially larger document sizes, leading to increased RAM usage and possible data fragmentation on disk if the document size exceeds allocated space during updates, which can impact read/write performance .
In a partitioned network environment, challenges with MongoDB's default replication model include temporary unavailability of the primary node, leading to a necessity for elections to select a new primary from the secondaries, which can cause delays in write operations . Additionally, the possibility of split-brain scenarios arises when a network partition leaves multiple sets of nodes each thinking they are in the majority, potentially leading to data inconsistency. With smaller numbers of nodes or even numbers of nodes, there can be situations where no node majority is achieved, rendering the entire replica set read-only and causing service disruptions without careful management and use of arbiter nodes for ensuring election processes succeed smoothly .
Object-Relational Mapping (ORM) in RDBMS involves a relatively demanding process of translating data between incompatible systems using object-oriented programming languages, often leading to impedance mismatch between the application's objects and database tables . In contrast, Object Document Mapping (ODM) used in document databases like MongoDB is faster and more efficient as JSON documents closely resemble the structure of in-memory objects originally used in JavaScript. This reduces complexity in data serialization and deserialization processes, allowing for direct storage and retrieval of application objects as documents .
MongoDB manages high availability and consistency through its replica set model, where each replica set comprises a primary node handling all write operations and secondary nodes that replicate the data . If the primary becomes unavailable, an election is held to promote a secondary to the primary role, ensuring continued availability. MongoDB's distributed architecture supports eventual consistency and automatic failover, and it uses configurations like read preference modes to balance the application read requests across primary and secondary nodes .
BSON (Binary JSON) is used in MongoDB as a storage and data transmission format due to several benefits it offers over JSON, such as being a binary-encoded serialization that allows it to store additional data types like dates and raw binary data, which JSON does not natively support . BSON facilitates faster data parsing and is more efficient in terms of both space and speed for database operations, which is crucial for high-performance needs. BSON's design is meant to be efficient in several ways including supporting fast scans and indexing, which is critical for MongoDB’s high throughput operations .
Different index types in MongoDB, such as single-field, compound, multikey, hashed, geospatial, and text indexes, each have unique implications on performance. Single-field indexes are straightforward and improve query performance by allowing quick lookups on individual fields . Compound indexes optimize queries that use multiple fields, multikey indexes are used for array fields to create separate index entries for each array element, and hashed indexes are efficient for equality searches . Geospatial and text indexes are specialized indexes; geospatial indexes support spatial queries while text indexes facilitate searching for string content across a collection. Proper use of indexes can greatly enhance query performance by reducing the amount of scanned data, but they also consume additional resources and might slow down write operations due to the need to update the indexes on data changes .
Sharding in MongoDB is used to distribute data across multiple machines to support databases that require horizontal scalability and must handle large volumes of data or high throughput operations. Sharding becomes necessary when a single machine's capacity is insufficient to handle the data volume or traffic . Components involved in sharding include shards, which store the data and are often replica sets; query routers, which direct client requests to the proper shards; and config servers that maintain the metadata about how data is distributed .
Read preference modes in MongoDB dictate from which members of a replica set the read operations are performed, affecting load distribution and data staleness considerations . For instance, 'primary' mode ensures reads go to the primary for the latest data, while 'secondary' mode balances the load by allowing reads from secondary nodes. Modes like 'primaryPreferred' and 'nearest' further help optimize performance and availability by placing preferences on certain nodes under specific failure conditions or proximity constraints. These preferences are crucial for optimizing read throughput, maintaining data consistency, and ensuring application availability, particularly in geographically distributed deployments or high-throughput environments .
MongoDB's journaling feature contributes to data durability by ensuring write operations are recorded in memory and into a journal prior to being applied to data files on disk. In the event of a system failure, the journal allows MongoDB to restore the database to a consistent state by replaying logged operations, minimizing data loss . However, journaling has limitations, such as potentially not being enabled on all systems by default, and the overhead of maintaining additional write-ahead logs, which can affect overall write performance. Its effectiveness also depends on prompt application of journaled operations to data files to ensure minimal data exposure to failures .
MongoDB’s transaction mechanisms, which offer atomicity at the document level, may be inadequate for applications requiring multi-document transactions with strict ACID compliance, such as those involving financial operations or complex business processes . MongoDB's lack of native support for multi-document ACID transactions prior to version 4.0 is a limiting factor, especially if multiple document operations must be isolated within a single transaction. Possible solutions include application-level mechanisms for ensuring consistency through careful management of write operations or upgrading to newer MongoDB versions with support for ACID transactions across multiple documents utilizing the two-phase commit protocol for complex transactional requirements .