High Performance with MongoDB

Systems and MongoDB Architecture

Every database deployment operates within a larger system that includes networks, applications, hardware, and users. To scale or optimize MongoDB effectively, we must approach it with the mindset of a system designer, rather than thinking only as database engineers. This chapter introduces that shift in thinking. It will help you see how small changes in one area can lead to significant and sometimes unexpected effects across an entire system. Solving performance problems is not just about tuning technical settings; it also requires a deep understanding of how systems behave.

We will begin by looking at systems more broadly, starting with examples from nature and the real world. You will be introduced to key concepts such as delays, feedback loops, and bottlenecks. These ideas will help explain why systems sometimes behave in counterintuitive or surprising ways when we attempt to change them. As the chapter progresses, we will gradually narrow our focus. We will move from general systems to software systems, then to data services, and finally, to MongoDB. Along the way, we will provide a high-level overview of common bottlenecks and examine two specific examples in more detail. We will also introduce a step-by-step process that you can use to diagnose and resolve performance and scalability issues.

Every system contains a limiting factor, or bottleneck. When a system has multiple inputs and components, the true limiting factor is not always obvious. Improving other parts of the system will not increase overall performance unless the bottleneck itself is addressed. In my experience, many people spend time optimizing the wrong part of the system because they make incorrect assumptions about where the bottleneck lies. A structured approach, based on measurement and experimentation, makes it easier to identify and address the real issue.

This chapter will cover the following topics:

Examining the fundamental characteristics of systems
Exploring a typical software system and identifying potential performance bottlenecks
Introducing MongoDB architecture and its foundations
Understanding the connection between core MongoDB components and their role in application performance
Discussing strategies for managing complexity in modern data platforms
Reviewing performance monitoring tools and observability techniques
Finding bottlenecks and the iterative process for tuning

What are systems?

A system is an interconnected set of elements organized in a way that fulfills a purpose. Donella Meadows defines systems as follows:

A set of things—people, cells, molecules, or whatever—interconnected in such a way that they produce their own pattern of behavior over time. The system may be buffeted, constricted, triggered, or driven by outside forces. But the system’s response to these forces is characteristic of itself, and that response is seldom simple in the real world. [1]

We see systems at many scales. Our solar system is a system of planets. Earth has weather and climate systems, as well as political and social systems such as countries and cities. A city contains institutional systems such as universities, which, in turn, have academic systems such as faculties, professors, and students. The engineering faculty has a lab system with multiple computers connected through a network system of bridges and routers. Each computer is a system of hardware and software components, including processors, memory, and applications.

A store in the city is a system that manages products such as computers, with processes for inventory replenishment from suppliers and factories. Beyond the city, a forest is an ecological system containing trees, each a system of roots, branches, and leaves. A wild boar in the forest has biological systems, including skeletal, nervous, circulatory, and digestive systems. Even a single cell within the boar’s ear is a system, with its own internal subsystems such as mitochondria, a nucleus, and ribosomes.

The common thread unifying all these different systems is that their elements interact with each other within the system and with other systems around them.

A system’s performance is often not determined by the strongest or fastest element, but by the weakest link. This is sometimes referred to as the bottleneck or the limiting factor. This is the element within a system that significantly restricts the system’s overall performance or growth. Bottlenecks can change over time. If you fix a bottleneck, another element will then become the bottleneck. Remember, everything that is true for general systems is true for software systems, as well. In later sections, we’ll look at typical software bottlenecks and some of the tactics you can use to fix them.

Characteristics of systems

In this section, we’ll look at the potential pitfalls we can encounter when changing systems. We’ll see how buffers and delays can make it hard to understand a system’s behavior, and how changes can lead to unexpected outcomes.

Changing systems is a risky business

The elements of a system can interact in non-linear ways. For example, a school is a system with elements such as students, classrooms, playgrounds, and basketball hoops. The behavior of the system doesn’t always emerge in a predictable way, as discussed here:

Small changes can have big effects (butterfly effect): An insignificant action can lead to massive, unpredictable consequences. For example, a student starts running in the playground. Another student starts running after the first one. Soon, more and more students see the fun and join in. Suddenly, half the playground is playing tag.
Feedback loops (chain reaction): This is a direct, step-by-step process where one event leads to another. In the cafeteria, a student tells a joke. If it’s funny, a few others laugh. Their laughter makes others in the cafeteria curious, and soon, everyone is laughing, even those who didn’t hear the joke. The laughter spreads in a way that builds on itself. This is called a positive feedback loop.
More than the sum of its parts (synergy): A system can produce results that are greater than just adding up its components. During band practice, a single drum beat sounds simple. A single trombone note is cool. But when a whole band plays together, the music sounds much better than just adding up individual sounds.
Unexpected outcomes (emergence): Systems can create results that aren’t obvious from the individual parts. For instance, during art class, a student mixes red and yellow paint to get orange, which surprises them.

Non-linear interactions mean small actions can trigger big changes, feedback loops can amplify their effects, elements can work together unexpectedly, and new outcomes can emerge. In general, humans struggle to comprehend non-linear interactions between different elements of the system and tend to see behavior as a series of discrete events, making it easy to miss the underlying patterns.

We can build models of systems and see how they behave. Of course, the real world is always going to be more complex than any model we can build. A model of a system has inputs, stocks, variables, and feedback loops. An example of a store/inventory system is shown in Figure 1.1.

We’ll use this example to explain the concepts of stocks, variables, and feedback loops. Reading from the top left in an anticlockwise direction, the system has an inflow of deliveries and an outflow of sales. The products in the store are in stock, called the inventory.

Figure 1.1: An example system model showing store inventory management

The delivery delay and customer demand are inputs regulating the flow of products. B signifies a balancing feedback loop. Customer demand causes the products to flow out of the system as sales, reducing inventory. The manager doesn’t want to keep checking the stock every day, so there’s a perception delay variable. After the delay, the manager compares perceived sales with the desired inventory. The manager doesn’t want to overreact to a change in customer demand, so they have a response delay variable limiting the size of an order to the supplier, who then delivers more product after a delay.

This model only uses balancing feedback loops. The other kind of feedback loop is a reinforcing loop, which typically increases the size of stocks. For example, in a bank savings account, the more money you have in the account, the more you earn in interest, which, in turn, increases the amount of money in the account.

A system with no delays is simple

Let’s say the manager wants to keep 10 days of demand in inventory, and the demand is 200 items per day. This means the desired inventory is 2,000. On day 10, demand increases by 10%. Customers are now buying 220 items per day. The manager increases the size of the order, and more inventory is delivered on the same day. The desired inventory goes up to 2,200.

Figure 1.2: Behavior over time in a simple system with no delays

When there are no delays in the system, adjustments can be made instantly to changes in demand, making the system’s behavior straightforward and predictable.

A system with delays can behave in unexpected ways

In complex systems, there are often delays between taking an action and seeing its effect, as described by Donella Meadows:

Because of feedback delays within complex systems, by the time a problem becomes apparent it may be unnecessarily difficult to solve. [1]

For example, if inventory orders take several days to arrive, a sudden increase in demand might not be addressed quickly enough. This lag can cause overcorrections or shortages, creating instability. Understanding and managing these delays is crucial to maintaining smooth system performance.

Now, let’s add some delays:

Perception delay: Before changing the order, the manager averages sales over a 5-day period
Response delay: The manager makes up 40% of the inventory shortfall when changing the order
Delivery delay: The factory takes 5 days to deliver the order

Figure 1.3: Inventory levels oscillate after demand rises due to delayed order adjustments

As you can see in Figure 1.3, after day 10, when demand increases by 10% first, we see the stock levels start to drop. Then, after 5 days, the manager adjusts the order to make up 40% of the shortfall. The increased deliveries start to arrive, and on day 20, the stock equals the desired inventory. However, the increased deliveries continue, and by day 25, the store is overstocked. The manager reduces the order. The cycle continues, and the inventory oscillates between under- and over-stocked.

Trying to fix oscillations

Here are three options you could try to fix this unwanted behavior:

Faster perception (reduce delay)
Faster response (increase order)
Slower response (decrease order)

Close your eyes for a minute and decide which of these options you’d try first. We’ll examine the outcome of each option.

First, we’ll reduce the perception delay from five days to two, as shown in Figure 1.4.

Figure 1.4: Reducing perception delay from five days to two days for faster response

So, the manager reacts sooner, and the desired inventory goes up to 2,200 faster, but if anything, the oscillations start a little sooner, and they are slightly higher. The peak inventory just before the 60-day mark is now 2,994 items. It was 2,981 before changing the perception delay.

Next, we’ll try responding faster by changing the order to make up 60% of the shortfall, which was originally 40%. As you can see in Figure 1.5, the behavior gets even worse. The oscillations get much bigger, and the peak inventory goes even higher to 3,800 and falls to 1,400 some days.

Figure 1.5: Reducing response delay – inventory swings grow larger

Finally, we’ll try a slower response, reducing the response delay (the change in the order) to 25% instead of 60%, as we saw in the previous example.

Figure 1.6: Slower order changes reduce inventory oscillations

As shown in Figure 1.6, the behavior of the system is much better now. The inventory still oscillates, but it starts to settle close to the desired inventory around 20 days after the change in demand.

Systems surprise us

Systems can react in counterintuitive ways. In this relatively simple system, the store manager was reacting too quickly. Our natural inclination was toward faster perceptions and responses, but they made the system behave even worse. It was only when we tried something that seemed counterintuitive (slower responses) that the behavior of the system improved. This is also true of software systems, and we’ll see some practical examples later in this chapter. Changing some variables can have a big impact on the overall behavior, but others might have little to no effect. In these examples, we saw that a perception delay had little effect, but a response delay did.

Real systems are always more complex than models and can have hidden variables or feedback loops. This system was relatively simple. A more complex system can have multiple stocks and more feedback loops. Systems are inherently oscillatory. System behavior is revealed as a series of events over time. Our human minds are not good at understanding non-linear growth and interactions between elements. We make assumptions about what the current bottleneck is and try to improve elements of the system that will not increase performance unless the limiting factor is improved.

A typical software system

Now, let’s apply what we’ve discussed about systems in general to a software system. We’ll begin by looking at a typical software system at a high level, as shown in Figure 1.7, and describe some of the potential bottlenecks.

Figure 1.7: High-level view of a software system showing a client-to-database request flow

Looking from left to right, we see two clients. These could be a desktop computer and a mobile device. These clients connect via the internet to an application programming interface (API) gateway, which sends requests and receives notifications. The API gateway connects to multiple application servers. These modify, route, and forward client requests to one or more software components running on those application servers. Each application server can have multiple software components using one or more frameworks (FW) such as Spring or Rails. These typically contain data service abstractions or object modeling components that use a database-specific driver (sometimes called a client library) that talks to the data services.

Any of these components (the computers they run on or the networks connecting them) could be a bottleneck to the performance or scalability of the overall application.

Here are two examples of performance bottlenecks in a software system:

An API gateway bottleneck: As the system gains popularity and more users connect to the API gateway, the API reaches a point where it is handling so many connections and requests that the CPU is fully utilized. This could be identified by looking at the idle CPU available on the computer running the API gateway. This could be solved by vertically scaling the computer or the cloud instance running the API gateway.
A framework bottleneck: The application software may send seemingly simple requests to the framework, but under the hood these requests can trigger expensive database operations, such as a full collection or table scan on the database. This inefficient querying uses CPU and storage bandwidth on the database while also sending more data than necessary back to the application servers. As a result, unnecessary network bandwidth is used, and additional CPU resources are required on the application server to process the data.

This latter bottleneck could be identified in a number of ways:

Profiling the application could find that the requests to the framework are taking a surprising amount of time.
During development, an end-to-end trace can be used to track a user request from a client through the application code, framework, and data services abstraction to see what kind of query is run on the data services and the subsequent input/output (I/O) operations.
This could identify read or write amplifications, where a small client request results in a disproportionately large database operation. For example, a client query for one data point of 32 bytes could make its way through the full system and result in a table scan on the database that pulls 32 GB of data through I/O and caches for no reason. This could be remedied by configuring the framework to be more efficient.

We will provide specific examples of these kinds of bottlenecks and solutions throughout the book, culminating in a number of specific case studies in Chapter 15, Debugging Performance Issues.

Identifying potential bottlenecks is only the first step in optimizing a software system. To remove these bottlenecks, we need a structured approach applying performance engineering principles.

Performance engineering principles span a range from algorithmic theory to Little’s law. By applying these principles, developers can minimize bottlenecks, improve scalability, and maximize resource efficiency. In the next subsections, we’ll explore some of these principles and see how they guide the design of high-performing systems, which you can keep in mind as you read this book.

Algorithmic efficiency (complexity)

The choice of algorithm and data structure is typically where most computer science courses first introduce performance. An algorithm that scales linearly or logarithmically with input size will perform better than one that scales exponentially. No amount of micro-optimization will compensate for an inherently slow algorithm. In modern systems, this might mean using built-in libraries (often optimized in C or assembly) or leveraging parallel algorithms.

Avoid premature optimization

As the saying goes, “Make it work, then make it fast.” In other words, focus on building a correct and reliable system before worrying about performance.

This idea is echoed by computer scientist Donald Knuth, who famously warned that “premature optimization is the root of all evil.” Developers often guess wrong about what’s slowing their systems down; profiling tools regularly uncover performance bottlenecks in unexpected places. Modern processors can also execute compiled code in ways that defy human intuition.

By resisting the urge to optimize too early, you keep your code clean and maintainable. You also ensure that your performance efforts are grounded in real data, not assumptions.

Amdahl’s law (limit of parallel speedup)

Amdahl’s law explains why certain workloads don’t speed up proportionally with more cores. If a fraction S of a program must run sequentially, for example, to protect a shared data structure, then even with infinite processors, the speedup is limited to 1/S. For example, if 10% of a task is serial, the best overall speedup you can get approaches 10x (assuming the other 90% runs on many processors). When using multicore processors and distributed computing, one must identify and reduce the serial portions of the workload.

Locality and caching

Programs typically access a relatively small portion of data or code repeatedly in a short period of time. Caches can be used to speed up access to frequently used data. Systems use caches at various levels: in-memory caches for database results, and CPU caches to avoid recomputation or slow accesses by storing results of expensive operations and reusing them. However, caching introduces complexity, such as stale data and memory overhead. This is why it’s mentioned in the following joke:

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

Leon Bambrick

Ironically, there’s now a counterpoint to that joke:

There’s 2 hard problems in computer science: we only have one joke and it’s not funny.

Phillip Scott Bowden

The use of caching (e.g., in I/O operations) is crucial to get good performance out of slow devices, but it’s also applicable at other layers of distributed systems.

Little’s law (throughput versus latency)

Performance is multi-dimensional. Sometimes the goal is to handle a high volume of operations (throughput), while other times, the priority is to minimize the delay between the start and completion of a single operation (latency). In many cases, both are equally important.

You can increase concurrency to improve throughput (e.g., using multithreading, asynchronous I/O, or non-blocking event loops). However, these techniques might introduce queuing delays that affect latency. Having too many concurrent operations can push a system into an inefficient state. It’s important to first identify your performance target and then optimize accordingly. For high throughput, maximize resource utilization (CPU, network, and disk) with concurrency. For low latency, minimize waiting and depth of processing. Balancing these is an art, guided by analysis and benchmarks.

In summary, performance engineering should be data-driven, using profilers and monitors to identify bottlenecks. It should be informed by both classic insights (such as algorithmic complexity and Amdahl’s law) and modern techniques (such as parallel computing, vectorization, and distributed load balancing). Focus optimization efforts where they yield the most benefit, and always verify the impact with measurements.

Understanding MongoDB architecture

Now that you understand systems in general, it’s time we start looking at MongoDB systems in particular. Before we do that, let’s take a look at MongoDB architecture at a high level. We’ll dive deeper into each of these components throughout the book.

First, it’s important to understand the foundations of MongoDB. MongoDB was created to bridge the gap between traditional relational databases and simple key-value stores. The central idea was to bring together the best aspects of both worlds (flexibility, speed, and an easy-to-understand data model), while offering features such as powerful querying, indexing, and complex aggregations. In addition, from the early days, MongoDB had native libraries for many languages, built-in replication, horizontal scaling via sharding, and an intuitive document data model.

Key-value stores are often praised for their speed and simplicity. MongoDB, however, goes beyond that simple approach by letting you store and query entire documents (objects) with rich relationships and structure. Relational databases, on the other hand, organize data into tables with predefined columns and relationships and enforce strict schemas.

In the next section, we’ll take a closer look at the MongoDB document model and how it differs from a relational database.

The document model: MongoDB’s foundation

In a relational database, data is typically normalized into separate tables to support a wide range of use cases. This can require complex joins for application-specific views. In contrast, MongoDB encourages you to model data as application-specific documents, optimized for how your application accesses and uses it. This flexibility allows for representing complex relationships and hierarchical data within a single document.

JavaScript Object Notation (JSON) is a lightweight, human-readable data format used to store and exchange data between applications. MongoDB’s document model stores data in JSON-like structures called Binary JSON (BSON) documents.

Here’s an example of a document:

# A simple document representing a user
user = {
    "name": "Jane Doe",
    "age": 30,
    "email": "[email protected]",
    "address": {
        "street": "123 Main St",
        "city": "Anytown",
        "state": "CA",
        "zip": "12345"
    },
    "interests": ["hiking", "photography", "travel"]
}

This contains a number of properties about a user. In a relational database, the user’s interests would be stored in a separate table, which would have to be queried separately or joined with every time you needed the user’s interests alongside the other user information. From a performance perspective, in MongoDB, the array of interests is stored with the user information, so all user information can be read from storage in one access and received by an application in one query. In addition, interests can be indexed so that it’s simple and fast to query for all users interested in hiking, for example.

Have you ever tried to model hierarchical data in a relational database? If you have, you know it often feels like trying to fit a square peg into a round hole. In MongoDB, nested structures are a natural fit, with no complex joins or entity-relationship diagrams required.

Key architectural components of MongoDB

The MongoDB server uses a single mongod process that can run anywhere. You can deploy the mongod process on many classes of computers, including laptops, self-managed servers within data centers within your business, cloud instances managed by your business, and cloud instances managed by MongoDB Atlas.

Many of the concepts we’ll discuss apply to this server process; however, it’s also a building block of the MongoDB distributed system. A basic deployment in production would usually at a minimum be a replica set with three computers, each running a mongod process. One of them would be the primary, and the other two would be secondaries. All members of a replica set have the same copy of the data. You will learn more about this in Chapter 5, Replication, but the most important thing to know is that replication provides high availability for your application by monitoring the health of all members and electing a new primary automatically should there be a failure of the old primary.

When you horizontally scale MongoDB, you would treat your replica set as the first shard and then add as many additional shards as you need. Each shard is a replica set. Each shard holds a subset of the data, and a smart router process directs operations to the correct shard, eliminating the need for your application to know how the data is distributed. You’ll learn more about this in Chapter 6, Sharding.

Atlas is an integrated suite of cloud database and data services to accelerate and simplify how you build with data. Atlas extends MongoDB’s flexibility and ease of use to enable you to build full-text search, real-time analytics, and event-driven features.

MongoDB client libraries (or drivers) allow your application code to communicate with the database using a native library and APIs that look like any other API in the language you’re using. You construct a request in your preferred programming language, and the driver converts it to messages MongoDB understands.

MongoDB provides official drivers for most of the popular programming languages, including Python, Java, C#, and Node.js, among others. You can find the full list in the MongoDB documentation at https://www.mongodb.com/docs/drivers/.

Here’s a simple example of how to connect to MongoDB and insert a document using PyMongo, the official MongoDB driver for Python:

from pymongo import MongoClient
# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/');
db = client['store_database'];
collection = db['products'];
# Define a product document with nested information
product = {
    "name": "Ergonomic Keyboard",
    "price": 129.99,
    "category": "Computer Accessories",
    "details": {
        "weight": 1.2,
        "dimensions": "18 x 6 x 1 inches"
    },
};
# Insert the document
collection.insert_one(product);

What you don’t see in the preceding code is all the heavy lifting the client library is doing: managing connections, communicating the MongoDB wire protocol to the server, handling network timeouts, implementing retry logic, and transforming Python dictionaries to BSON and back.

The following code queries the products collection for affordable products (less than $100):

# Find affordable products
affordable_products = collection.find({'price': {'$lt': 100}});
for product in affordable_products:
    print(f"Found affordable product: {product['name']}");

When MongoDB returns a document, it arrives as a native data structure in your language, which is a Python dictionary in this case. No awkward conversion is needed. This natural mapping between your application code and database representation is a significant advantage for developer productivity.

The data services system

Let’s zoom in now to the data services part of the overall software system that we looked at earlier in Figure 1.7.

Figure 1.8: MongoDB architecture showing the connections between core database components

Looking from left to right in Figure 1.8, we see that the application servers are communicating with the data services through a driver or client library. In this case, we are showing a sharded cluster. For simpler deployments, the application server could connect to a single replica set without the mongos and config server. Conceptual elements such as the schema, queries, and indexes are shared between the application server and the data services layer. Some performance and scalability issues can be caused by these conceptual elements, and changing them might be the source of some improvements.

In this case, we are looking at a diagram of an Atlas deployment that includes elements such as the Atlas UI (which includes monitoring and backup), the Atlas control plane, charts, and features such as Vector and Semantic Search, and Data Federation.

The rest of the components are considered the core database, and they are as follows:

mongod: A single server process. Atlas can deploy multiple mongod processes in replica sets (for high availability) and shards (for horizontal scaling).
mongos: A router process for a sharded cluster. It routes queries to one or more shards and combines results from multiple shards into responses.
config: A special, internal database that stores the metadata for a sharded cluster. The metadata reflects the state and organization of all data within the sharded cluster. mongos uses this metadata to route read and write operations to the correct shards. It also maintains a cache of this data.

Finally, in Figure 1.8, the mongod processes will save and retrieve the data via the storage subsystem. This could be a disk in a self-managed MongoDB system or cloud storage, such as the following:

Elastic Block Storage (EBS) in AWS and Alibaba Cloud
Managed Disks in Azure
Persistent Disks (PD) in GCP
Locally attached ephemeral SSDs

When using Atlas, we present the overall data services system as a black box with a small number of settings you need to consider. Atlas takes care of the details for you. The settings from an Atlas perspective are as follows:

The number of shards and replica set members.
The instance size, which determines the number of virtual central processing units (vCPUs), the amount of memory, and network performance.
The storage configuration, which includes the size and throughput of the device. Throughput is usually defined as input/output operations per second (IOPS).

If you are self-managing MongoDB on your own cloud or physical servers, you have visibility of many of the component settings inside the server and software that the mongod process is running on. This is a double-edged sword. You have more settings you can tune to improve performance. However, you also have more elements to manage and potential bottlenecks to explore.

Now, if we zoom in to the mongod, we find that it is itself a system with multiple elements. The following diagram (Figure 1.9) shows the main elements inside the MongoDB server (mongod):

Figure 1.9: Elements inside the MongoDB server

Query engine

MongoDB’s query engine takes your query, determines the most efficient way to find the data you need, and returns the results. It uses indexes to optimize query execution and passes requests to the storage engine to execute operations.

An index is like the index at the back of a book. Instead of reading every page of a book to find a topic, you can use the index to directly find out which pages contain the topic. Similarly, in a database, an index helps the query engine locate data more efficiently.

MongoDB allows you to do a lot more than a key-value store. MongoDB’s query and indexing capabilities rival (and, in some ways, surpass) traditional relational databases:

You can query on any field, not just the primary key
You can perform complex logical operations
You can query inside nested documents and arrays
You can aggregate and analyze data
You can create indexes on any combination of fields, including arrays and subdocuments
You can add multiple indexes to a collection of documents

Indexes make queries more efficient. As with many aspects of performance, there are costs to indexes, too. When creating/inserting new documents, every index will necessitate an additional write operation to storage. Similarly, when updating an indexed field of a document, an extra write will be needed to update the index.

You can get the query system to provide explain plans, which provide details about how a query ran. This can provide valuable information for finding potential bottlenecks and generating ideas for improvements.

We’ll go into a lot more detail about indexes and queries in Chapters 3, Indexes, and Chapter 12, Advanced Query and Indexing Concepts.

Storage engine/WiredTiger

Each mongod process needs to store the data on disk. The storage engine is MongoDB’s filing system; it’s the component responsible for managing how data is stored, retrieved, and maintained on disk. Since MongoDB 3.2, WiredTiger has been the default storage engine.

The storage engine takes care of persisting data and indexes on disk, fetching the data from disk, as well as handling the compression and encryption of the data on disk. It also handles the journal known as the write-ahead log in many databases. The storage engine also contains an in-memory cache of the most recently accessed documents. You can change the size of the WiredTiger cache using the wiredTigerCacheSizeGB setting. The remaining memory will be used by the operating system’s filesystem (page) cache. You’ll learn more about storage engines in Chapter 7, Storage Engines.

Libraries

When developing MongoDB, we used many other building blocks and libraries. These also have settings that can be changed to improve performance.

Starting with MongoDB Server 8.0, a newer version of the memory allocator (TCMalloc) supports per-CPU caches, instead of per-thread caches. This can reduce memory fragmentation and make your database more resilient to high-stress workloads. A number of system configuration settings may need to be modified to use per-CPU caches. You’ll find more details on system settings in Chapter 13, Operating System and System Resources. If you really know what you are doing, you can change controls such as the logical page size, but this could necessitate changing the settings in the MongoDB source code and rebuilding the server. Again, Atlas provides defaults and implements best practices.

Compression algorithms are more accessible. You can specify the storage compression algorithm to be used either when creating a collection or via configuration options when starting the server. By default, Atlas uses an algorithm called Snappy. When running self-managed MongoDB, you have four compressor options:

None
Snappy: Fastest, lowest CPU consumption; doesn’t compress data as much
Zstd: A middle ground between Snappy and Zlib
Zlib: Slowest, highest CPU consumption; compresses data better

You can also use network compression between the driver and the MongoDB server. In this case, you can select the compression algorithm via the driver in your application code, whether the server is on Atlas or not.

MongoDB uses OpenSSL to handle TLS/SSL encryption for the following:

Client-server connections (e.g., mongod ⇄ MongoDB Shell)
Replication and sharded cluster communication
MongoDB drivers connecting to the database

Different versions of OpenSSL have different performance characteristics and scalability on higher-tier instances/servers.

Other system components that mongod uses

MongoDB runs within an operating system, which, in turn, is managed by a kernel. In operating systems such as Linux, parameters such as vm.dirty_ratio and vm.dirty_writeback_centisecs tune the kernel’s mechanisms for managing virtual memory. They change how data is handled before being written from memory to disk.

MongoDB can run inside virtual machines (VMs). However, you need to make sure it’s configured correctly from the VM perspective. For example, over-allocating vCPUs can lead to CPU contention among VMs running on the same host, which can degrade performance. You can learn more about concepts such as ballooning and how to configure MongoDB for VMs in the MongoDB server production notes at https://www.mongodb.com/docs/manual/administration/production-notes/.

MongoDB runs on many types of CPUs in networked systems with different amounts of memory. These typically depend on the cloud instance size or computer server on which you deploy MongoDB. In general, larger instances or servers will have more CPUs, memory, and higher network bandwidth.

As we saw earlier, with Atlas, you can select different types and configurations of cloud storage. You typically have the same options when running your own cloud instances or self-managed storage. In addition, you can choose different filesystems and settings, such as readahead. When a program reads a file, Linux anticipates future reads and loads additional data into memory. This reduces the number of disk access operations, making sequential reads faster. This will be discussed in more detail in Chapter 13, Operating System and System Resources.

Managing complexity in modern data platforms

Let’s zoom back out from a single mongod process to look at the data services layer again. To your application, the data services element is one component, but in reality, it can be a complex distributed system in its own right with multiple individual servers.

MongoDB’s approach is to provide a data services API that simplifies this complexity using three key principles: a flexible data model, built-in redundancy and resilience, and horizontal scaling.

Flexible data model with rigorous capabilities

The document model allows you to represent relationships naturally while still supporting rigorous data governance when needed. One MongoDB user explained it perfectly: “We can push updates to our application without having to coordinate complex database migrations. The schema evolves with our code, not against it.” You can learn more about this flexible data model in Chapter 2, Schema Design for Performance.

Built-in redundancy and resilience

Rather than treating high availability as an add-on feature, MongoDB bakes resilience into its core architecture through replication:

Automatic failover: If the primary node fails, an election determines a new primary within seconds
Self-healing: When failed nodes recover, they automatically catch up and rejoin the replica set

Again, Atlas provides defaults, implements best practices, and simplifies the number of settings you need to consider. When self-managing MongoDB, you can dig deeper into replication settings. For example, if running MongoDB on your own servers, you can change server parameters to tune how replication data is batched. Larger batches reduce network traffic and the number of IOPS required on secondaries, but can add latency for writes. You can also configure flow control, which can prevent replication lag by slowing down writes if secondaries are falling behind. You can learn more about replication in Chapter 5, Replication.

Horizontal scaling with intelligent distribution

MongoDB’s sharding capability distributes data across multiple replica sets while keeping related data together:

Automatic balancing: MongoDB continuously rebalances data across shards as your data grows
Query routing: The MongoDB router (mongos) intelligently directs operations to the appropriate shards
Zone sharding: You can align data distribution with your infrastructure (keeping European customer data in European servers, for example)

MongoDB Atlas simplifies sharding by automating key processes, enabling best practices, and providing built-in features to ensure optimal performance. You can read more about this in Chapter 6, Sharding.

Performance tools

You can use performance monitoring tools to help find potential bottlenecks or limiting factors. These could be in any of the components or elements we mentioned earlier. More details will be provided in Chapter 14, Monitoring and Observability.

The mongod and mongos processes write log files with performance information. They also support the serverStatus command, which returns information about the running process. The query engine can “explain” how a query will run and supports a profiler feature. Command-line tools, such as mongostat and mongotop, are provided to monitor a running mongod process.

To monitor utilization of all kinds of system resources, you can use operating system commands, MongoDB Atlas monitoring, and other GUI tools that are designed specifically for monitoring multiple servers.

For system resources specifically, Linux commands such as top, vmstat, iostat, and netstat can be used to monitor CPU, memory, storage, and network at a point in time on a single server. Tools such as dstat and sar can be used to combine data from simpler tools and view it over a period of time.

MongoDB Atlas monitoring provides historical and real-time system resource data, which is integrated with database-specific metrics. It has tools such as Real-Time Performance Panel (RTPP), Performance Advisor, Namespace Insights, and Query Profiler. You can also set alerts for specific performance events.

For self-managed deployments, you can use graphical tools such as Netdata, Prometheus + Grafana, Zabbix, Munin, or Observium. You can use observability frameworks such as OpenTelemetry. Finally, you can use commercial application performance monitoring (APM) tools such as AppDynamics, Datadog, Dynatrace, LogicMonitor, New Relic, SolarWinds, and Splunk Observability.

Finding bottlenecks

As mentioned at the start of this chapter, every system has a limiting factor or bottleneck. With multiple inputs to a system, it’s not always clear what the limiting factor is. We need to find the element or component that is the current bottleneck. To do this, we need to use the tools mentioned previously to analyze the performance of all of the potential bottlenecks we mentioned so far.

In summary, the following table lists the potential bottlenecks you might encounter when tuning a typical software application. The first column shows the potential bottlenecks you could encounter on both Atlas and self-managed systems. The second column shows the additional bottlenecks you are responsible for with self-managed deployments. As you can see, using Atlas significantly reduces the number of bottlenecks you need to consider.

Atlas and self-managed (private cloud)	Self-managed (private cloud)
UI/application code	Network
Framework	Replication
Driver	Storage engine
Network	Libraries (allocator, compression, and encryption)
Schema/model	OS/kernel
Queries	Virtual machine
Indexes	Memory (extra details)
Sharding	CPU (extra details)
Cloud storage	Storage (extra details)
CPU	—
Memory	—

Table 1.1: Atlas vs. self-managed (private cloud)

For the remainder of this section, we’ll look in more detail at two example bottlenecks.

We’ll look at a storage bottleneck first, including the measurements and indicators that will help us identify it. In this example, we have a workload that’s mostly reading data:

95% of the database operations in the steady state are reads of documents using the _id field (primary key)
The documents are being retrieved in random order, meaning the _id values are being accessed in a different order from when they were inserted
The dataset is 10x larger than the memory on the database server, so most of the read operations cannot be satisfied by a cache read and they need to access the storage device.]
The storage device is specified to provide 4,000 IOPS

We see the system is running 5,500 database operations per second using mongostat or Atlas monitoring. We try to increase performance by scaling up the instance type. The database performance stays at 5,500 operations per second. After running this experiment, we know the bottleneck is not the CPU or memory size of the systems running the MongoDB server. We scale the instance size down again.

Next, we look at measurements of I/O:

If running self-managed instances, you can look at the number of IOPS using the iostat command on Linux
If running in Atlas, you can observe the number of IOPS being used on the Max Disk IOPS metric when you look at monitoring

To try and improve performance, we now increase the number of storage IOPS from 4,000 to 8,000. Database performance doubles to 11,000 operations per second. We see this situation quite frequently with customers. They assume they need to use a more powerful database instance when, in reality, they really need more IOPS from the storage subsystem.

In the second example, we’ll focus on a subset of Figure 1.8. Here, we have an application server interacting with a data services layer:

Figure 1.10: Application server connects to storage through data services

Initially, we assume the bottleneck is somewhere in the data services layer or below, but when we look at CPU and I/O utilization, we see idle CPU time, and the storage IOPS and bandwidth are not being reached. So, we assume the limiting factor is further upstream. Through measurements on the application server, we see it has no idle CPU time, so it is the bottleneck. It is not generating enough requests to fully load the data services or cloud storage layers.

As a general model of a service in a system, requests come in and are executed by a number of threads within the service. Some threads can interact via shared resources; for example, in a MongoDB server, a query being run by one thread could use space in the storage engine cache, slowing other threads.

Figure 1.11: Service threads handle requests

In this example, the application server is not sending enough requests that can be dispatched to be run in multiple threads within the data services layer. The general performance of a system with this kind of model is as follows:

Figure 1.12: Performance improves as load and threads grow until the system is overloaded

As the load and number of active threads increase, the performance in IOPS also increases until the system reaches a point where the threads start to contend for shared resources. Beyond this point, further increasing the load and number of threads can cause performance to decline. Most systems have this characteristic. Peak performance tends to be a factor of the following:

The amount of work being performed for each request
The ratio of threads to CPU cores
The contention for shared resources

The best way to understand this load/performance graph for your application or workload is to run stress/load experiments before you put your product into production.

To resolve this issue, we add a second application server; overall performance improves, but doesn’t reach our performance goal.

We re-analyze the load of the components and now see that the CPU on the server running the mongod process is fully utilized. Performance goes from the Underutilized part of the graph to the Overloaded part. We now see that the MongoDB server is the bottleneck.

Now that we have doubled the load from the application servers, we can reduce it in a finer-grained way by reducing the number of concurrent operations or threads running in the server. You could manage this from your application servers, but another way to do so is to lower the maxPoolSize limit in your database driver or framework/object-relational mapper (ORM). This will reduce the maximum number of connections to the database, which will reduce the maximum number of threads that can be active from your application server at a time. It brings the mongod performance back to the Peak part of the graph in Figure 1.12. With these two changes, we have now reached our performance goal.

This is another example of what seems like a counterintuitive change. Reducing something can improve performance.

An incremental process for optimization

In general, you can follow a seven-step process with a five-step loop. When we’ve seen people struggle with performance tuning and scaling, it is because they start in the middle, skip steps, or don’t know where they are in the process:

Figure 1.13: An incremental process for optimization and tuning

Let’s go through these steps in detail:

Determine the goal/requirement: There’s always room for improvement somewhere in a complex system. Without an agreed goal at the start of the process, performance tuning can become a never-ending task. The goal can become a moving target as the market or requirements change. Get agreement for what the performance requirements will be from project sponsors and stakeholders before starting the project. A performance requirement should be SMART, which stands for specific, measurable, achievable, relevant, and time-bound. Make it as fast as possible is not very specific, achievable, or time-bound. 1M queries per second on a 1GHz 4-core CPU, with 3k IOPS is not achievable. The requirement should also be relevant to your end user.

This is also a good time to think about the limits of scalability of the system. Presuming data can’t increase unbounded forever, how will data expire or be archived over time? What are the performance targets when data is being archived?

Gather data: Start by gathering data. The working set is the size of the frequently used data and indexes. Does the working set fit in memory? We’ll see examples of this calculation later in the book. In general, be aware of the costs of a particular operation. For example, if your application has an administrator feature that runs an occasional query over a large dataset, it’s likely that the data won’t be in any of the many caches. Worst case, the database will have to retrieve each document from a different disk location, resulting in I/O operations. Your system will have some limit on I/O bandwidth, and this is shared among all the queries running at the same time.
Generate hypotheses: Next, generate hypotheses, or possible explanations for the current bottleneck. Typically, there will be more than one possible hypothesis. It’s a good idea to get a group of people together to brainstorm. Each person will have different levels of experience in different parts of the system. However, everyone will have pet theories based on their experience. Sometimes discussions can get bogged down in one area. At first, focus on quantity; try to get people to suggest possible hypotheses without judgment or analysis.
Prioritize hypotheses: Then, walk through each hypothesis and try to prioritize them. Note any evidence or data that might support a hypothesis. Again, it’s important to time-box the discussion and not spend too much time on any one hypothesis.
Validate #1: Now that we have a #1 hypothesis or a prime suspect, we need to identify a test that could prove or disprove the hypothesis. For example, if we think a specific part of the system could be CPU-bound, try increasing the size of the server it’s running on.
Make a change: Next, make a change that should improve the situation and prove the hypothesis. For example, you could increase the number of IOPS or reduce or increase the number of threads in your application. Ideally, you should make one change at a time.
Measure effect: Finally, measure the change in performance. Hopefully, it’s an improvement. If the overall performance requirement is met, you can release and celebrate. If not, you’ll need to undo the changes that didn’t work, then iterate and look at the next hypothesis and bottleneck.

Optimization is not a one-time task. It is a structured, iterative process. By following the steps in order and avoiding the temptation to skip ahead, you can prevent common pitfalls and achieve measurable improvements in system performance.