Showing posts with label Performance. Show all posts
Showing posts with label Performance. Show all posts

Friday, 13 March 2026

The Tail at Scale


 When engineers talk about latency, they almost always talk about average latency. P50. The median experience. It's a comfortable metric — it responds to optimization, it's easy to visualize, and it makes dashboards look good. The trouble is that in any non-trivial distributed system, average latency is nearly irrelevant to whether your system actually feels fast.

A 2013 paper from Google, The Tail at Scale, reframed the entire conversation. The insight was simple: in a system where a single user request fans out to hundreds of backend machines, the response time is determined not by the average machine but by the slowest machine in the fan-out. The 99th percentile is not a corner case. It is, structurally, the common case for any sufficiently large request graph.

This is the founding observation behind Google's latency engineering philosophy — and it reshapes how you think about almost every architectural decision in a shared environment.

"In a large enough system, the tail is the average. The stragglers are not noise — they are the signal."

Why Shared Environments Are Inherently Hostile to Latency

A shared environment — whether it's a multi-tenant cluster, a distributed storage layer, or a cloud runtime shared across teams — introduces a class of latency that has nothing to do with your code. It comes from contention: for CPU time, for memory bandwidth, for network queues, for disk I/O. These are resources that your workload competes for with processes it has no visibility into and no control over.

Result is what the paper calls "variability amplification." Even a single machine exhibiting transient slowness — a GC pause, a cache eviction storm, a background compaction job — introduces latency that propagates through the system in ways that are entirely disproportionate to the duration of the original event. A 50ms hiccup on a single shard becomes a 200ms tail for every request that happened to touch that shard during that window.



This is the fundamental problem. And the conventional response — "profile and optimize the slow path" — doesn't work, because the slowness is not in the code. It's in the environment. You cannot optimize away a garbage collector running on a machine you don't control, or a noisy neighbor saturating the memory bus two NUMA nodes away.


Hedged Requests

The first and most important pattern the paper describes is the hedged request. The idea is counterintuitive enough that it's worth stating plainly: rather than waiting for a slow server to respond, you send the same request to a second server after a short delay — and take whichever response arrives first.

The delay is critical. You don't want to double your load by default. Instead, you observe your system's typical P95 response time and use that as the hedge threshold. If a request hasn't completed within that window, you issue an identical request to a different replica and race them. The moment either one responds, you cancel the other.



The practical effect is remarkable. Measurements at Google showed that hedging could reduce 99.9th percentile latency by an order of magnitude while increasing load by only a few percent — because most requests don't trigger the hedge at all, and those that do are precisely the ones stuck on slow replicas.

Hedged requests trade a small amount of extra load for a large reduction in tail latency. Load amplification is bounded by the fraction of requests that exceed your hedge threshold — which, by definition, is a small minority if you set the threshold near P95.

What makes this pattern powerful in shared environments specifically is that it sidesteps the cause of slowness entirely. You don't need to know why Replica 1 is slow. You don't need to detect it, alert on it, or drain it. You just race around it.

Tied Requests and Cancellation

Hedged requests have a subtle problem: if both replicas are fast, you've wasted work on both. Tied request pattern refines this by introducing coordination between the two requests. When you issue a hedge, you attach a "cancellation token" that the replicas share. Whichever replica starts processing the request first notifies the other to cancel, and proceeds alone.

This is particularly valuable when requests are expensive to process — when the work itself consumes significant CPU or I/O on the backend. Instead of duplicating work silently, tied requests minimize wasted computation by communicating intent across the request boundary.

The implementation requires some infrastructure: replicas need to be aware of each other's state for a given request, which typically means either a shared coordination layer or an out-of-band cancellation channel. In Google's architecture, this was handled via internal RPC cancellation propagation. In most systems, you can approximate it with request-scoped context cancellation — the Go context.Context model being a modern analogue of this idea.


Micro-Partitioning and Fine-Grained Load Balancing

Both hedging and tying are reactive: they respond to latency after it has occurred. Complementary proactive strategy is micro-partitioning — dividing work into far more partitions than you have machines, so that load imbalance between logical partitions can be corrected by reassigning partitions rather than migrating state.

Intuition is straightforward. If you have 100 machines and partition your keyspace into 100 shards, a hot key on one shard means one machine is overloaded and there's nowhere to move it without a full reshard. If instead you have 10,000 virtual partitions distributed across 100 machines, a hot partition can be migrated to a less-loaded machine in seconds, with minimal disruption.



This is less a trick and more a structural principle: partition granularity determines your ability to respond to imbalance. Google's Bigtable uses this extensively — tablet splits are designed to be cheap precisely so that hot tablets can be redistributed across tablet servers without downtime.

Good Citizens and Background Throttling

In any shared environment, there are two classes of work: foreground requests with latency SLOs that users directly feel, and background work — compaction, replication sync, index rebuilds, garbage collection — that has no user-visible deadline but consumes the same physical resources. The conflict between these two classes is one of the most consistent sources of latency spikes in production systems.

Solution is conceptually simple: background tasks must be "good citizens." They should yield CPU and I/O to foreground work when demand is detected. In practice, this means implementing throttle mechanisms that observe system load indicators — request queue depth, disk I/O wait, CPU steal — and automatically back off when those indicators cross a threshold.

Google's approach to this problem includes priority queues in their RPC layer, where foreground traffic can preempt background work mid-execution. Bigtable's compaction scheduler monitors foreground request rates and adjusts compaction aggressiveness in real time. The principle is that background jobs should "earn" their CPU time during slack periods, not consume it as a fixed entitlement.

What's important here is that this isn't optional in a shared environment — it's a contract. If your background jobs don't throttle, you are imposing your latency cost on every other workload sharing your infrastructure. In large organizations, this becomes a coordination problem: the team running the nightly reindex doesn't know which other team's latency SLO they're violating at 2am.

Selective Replication of Hot Data

The patterns above all treat slowness as something to route around or absorb. This final pattern takes a different approach: eliminate the bottleneck entirely for the data that matters most.

In most systems, data access follows a power-law distribution. A small number of keys — a viral post, a high-traffic configuration value, a globally shared counter — account for a disproportionate fraction of reads. These hot items are precisely the ones most likely to create queuing delays, cache evictions, and server-level saturation.

Solution is selective, on-the-fly replication of hot items. Rather than replicating everything uniformly, the system detects hot keys — through access frequency monitoring or explicit client hints — and creates additional in-memory replicas across multiple servers. Reads are then distributed across those replicas, reducing per-server load for the items that need it most.



This pattern is now standard in systems like Memcached (Facebook's lease mechanism was a direct response to this problem), Redis Cluster, and modern distributed caches. The underlying principle — don't treat all data as equally hot, and adapt replication depth to observed access patterns — generalizes far beyond caching.


Latency-Aware Load Balancing

Round-robin load balancing assumes that all backend servers are equivalent and equally available. In a shared environment, this assumption fails constantly. A server experiencing memory pressure or CPU saturation will accept requests at the same rate as a healthy one — queuing them invisibly while the client believes load is distributed evenly. The result is that round-robin actively routes traffic into latency holes it cannot see.

Latency-aware load balancing corrects this by making routing decisions based on observed response times rather than theoretical capacity. Client maintains a rolling measurement of each backend's recent latency and biases requests toward the faster ones. The simplest version is the "power of two choices" algorithm: rather than picking randomly from all backends, pick two at random and route to whichever has the lower current latency. The probabilistic gain is disproportionate to the cost — two random samples are enough to avoid the worst servers most of the time.



Elegance of this approach is that it requires no central coordinator and no global view of server health. Each client maintains its own local latency measurements independently. The collective effect of many clients doing this converges on a system-wide load distribution that naturally isolates slow servers — without any explicit health-checking infrastructure.

Google's gRPC and Envoy proxy both implement variants of this. Netflix's Ribbon client-side load balancer added latency-based weighting as a core feature after observing that round-robin was systematically directing traffic into degraded nodes during partial cluster failures.

Request Deadline Propagation

Every distributed request has an implicit budget: the maximum time the user is willing to wait before the response becomes useless. A search result that arrives after the user has navigated away is not a slow success — it is a waste of resources that could have been spent on a fresher request. Yet most systems treat their internal RPC calls as if they exist outside of time, with no awareness of how much of the outer deadline has already been consumed.

Deadline propagation makes the remaining time budget explicit and transmits it across every service boundary. When a frontend handler receives a request with a 200ms SLO and spends 30ms doing authentication, the downstream RPC it issues should carry a deadline of 170ms — not an unconstrained call that could block for seconds. Each hop in the call graph receives a shrinking time window, and each service is expected to abandon work and return an error rather than continuing once that window closes.

Deadline propagation transforms timeout from a per-hop configuration into a system-wide invariant. Instead of each service having its own independently configured timeout — which can add up to far more than the user's actual patience — the deadline flows through the entire call graph as a shared, decrementing constraint.

Without deadline propagation, a slow backend continues burning CPU on a request whose answer will never be seen. The frontend has already returned an error to the user, but the downstream services don't know this — they keep working, consuming resources that could serve other requests. With deadline propagation, a cancelled frontend request immediately cancels the entire downstream tree. The work stops the moment it becomes irrelevant.



Go's context.Context is the most widely adopted implementation of this idea in modern systems. Passing a context with a deadline through every function call is the idiomatic Go way of expressing exactly this contract. The Dapper tracing system and gRPC's deadline mechanism implement the same principle at the RPC layer.


Probabilistic Early Completion

There is a class of read-heavy workload where the question "what is the correct answer?" is less important than "what is a good enough answer, returned quickly?" Search ranking, recommendation feeds, autocomplete suggestions, approximate analytics — in each of these, the value of the response degrades gradually with quality, not catastrophically. A slightly stale recommendation list is almost as useful as a fresh one. A search result that includes 98% of relevant documents is indistinguishable from one with 100%, from the user's perspective.

Probabilistic early completion exploits this tolerance by allowing a request to return as soon as it has gathered "enough" signal, rather than waiting for every shard to respond. The coordinator tracks how many responses have arrived and, once a statistically sufficient fraction of shards have replied, returns the aggregated result rather than waiting for the stragglers. The remaining responses, when they eventually arrive, are discarded.

The fraction required is a tunable parameter that encodes the application's quality-vs-latency tradeoff. Setting it at 90% means the request finishes when 9 of 10 shards have responded — the one slow shard no longer determines the outcome. The quality loss is bounded by the fraction omitted, and in practice for approximate workloads the loss is negligible while the latency gain is substantial.

Probabilistic early completion only makes sense when partial results are semantically valid. It is appropriate for search, recommendations, aggregated metrics, and autocomplete. It is inappropriate for financial transactions, inventory updates, authentication checks, or any computation where partial data produces incorrect rather than merely approximate output.

Overload Admission Control

All of the patterns discussed so far are concerned with how an individual request navigates a slow or overloaded system. This one operates at a different level: preventing the system from accepting more work than it can complete within latency bounds in the first place.

The counterintuitive observation is that queueing is not a buffer — it is a latency amplifier. When a service is operating at capacity and accepts additional requests into a queue, those requests do not get served "slightly later." They get served much later, because every subsequent request must wait behind everything already in the queue. The 99th percentile latency of a system at 95% utilization can be ten times worse than the same system at 80% utilization, even though the throughput difference is modest.



Admission control accepts this reality and acts on it. Rather than allowing the queue to grow unbounded during traffic spikes, the system measures current utilization — active request count, queue depth, recent latency percentiles — and explicitly rejects incoming requests when those indicators cross a threshold. The rejected requests receive an immediate error rather than a delayed one. From the client's perspective, a fast rejection is often preferable to a timeout: it can retry against a different backend, fail fast, or serve from cache, rather than hanging indefinitely.

Google's internal systems use a technique called "client-side throttling" where the client itself participates in admission control: it tracks its own recent reject rate and probabilistically drops requests before sending them, reducing load on an already-stressed backend without requiring the backend to process and reject each request individually. Netflix's Concurrency Limits library implements a similar adaptive mechanism based on TCP congestion control algorithms — treating the request queue like a network pipe and backing off as soon as it detects queuing delay increasing.


Latency SLO Budgeting Across Teams

The patterns above are all technical. This last one is organizational, but its absence makes every technical pattern less effective.

In a large engineering organization, a single user-facing latency SLO — say, P99 < 300ms — is actually a composite of dozens of internal service SLOs. The frontend has a budget. The auth service has a budget. The ranking service has a budget. The storage layer has a budget. When these budgets are implicit, undocumented, or uncoordinated, teams make local decisions that are individually reasonable but collectively catastrophic. The auth team tightens its internal retry logic, adding 20ms to every call. The indexing team adds a synchronous cache warm-up step. Neither change violates any documented contract, and neither team knows what the other did. The cumulative effect is a P99 regression that shows up in the frontend SLO and takes weeks to attribute.

Latency budgeting makes these implicit contracts explicit. Each service in a call graph is assigned a latency budget — its maximum allowed contribution to the end-to-end P99 — derived from the top-level SLO. Changes that affect that budget require coordination across the services that share the call path. The budget is measured, reported, and treated as a first-class engineering constraint, like memory or CPU quota.

This is less a distributed systems pattern and more a systems-thinking pattern. Latency is a shared resource in the same way that bandwidth or storage is a shared resource. The only difference is that it is invisible until the moment it fails, at which point attribution is painful and slow. Making latency budgets explicit — even approximately — transforms latency from an emergent surprise into a managed constraint.

Embracing Stochastic Reality

What's important about these patterns, taken together, is what they have in common. None of them attempt to eliminate variability from the system. None of them assume that the environment can be made deterministic, or that every machine can be made equally fast, or that background noise can be suppressed. They all start from the premise that variability is irreducible — that shared environments will always produce straggler events — and design around it rather than against it.

"You cannot engineer your way to a deterministic distributed system. You can only engineer your way to one that degrades gracefully in the face of guaranteed non-determinism."

Old engineering intuition was that good infrastructure means predictable infrastructure — every component behaves the same way every time. New intuition is that good infrastructure means resilient infrastructure — every request completes within acceptable bounds, regardless of what any individual component is doing.

Each pattern acknowledges that something will go wrong and designs so that something going wrong in one place cannot become everything going wrong everywhere.

The question worth sitting with, for any system you're currently building, is not "what is our average latency?" It is: "when something goes wrong on one machine, where does that pain go?" If the answer is "it propagates to every user touching that partition," you have a tail latency problem, and the patterns above are where the solution starts.

Further reading: The ideas in this post are drawn from The Tail at Scale — Communications of the ACM, February 2013. Worth reading in full if any of this resonated.

Tuesday, 23 June 2020

Bit fiddling every programmer should know

Bit fiddling looks like magic, it allows to do so many things in very efficient way.
In this post i will share some of the real world example where bit operation can be used to gain good performance.

Technology Basics: Bits and Bytes - Business Technology, Gadgets ...
Bit wise operation bootcamp
Bit operator include.
 - AND ( &)
 - OR ( | )
 - Not ( ~)
 - XOR( ^)
 - Shifts ( <<, >>)

Wikipedia has good high level overview of Bitwise_operation. While preparing for this post i wrote learning test and it is available learningtest github project. Learning test is good way to explore anything before you start deep dive. I plan to write detail post on Learning Test later.

In these examples i will be using below bits tricks as building block for solving more complex problem.
  • countBits  - Count number of 1 bits in binary
  • bitParity - Check bit added to binary code
  • set/clear/toggle - Manipulating single bit
  • pow2 - Find next power of 2 and using it as mask.

Code for these function is available @ Bits.java on github and unit test is available @ BitsTest.java

Lets look at some real world problems now.

Customer daily active tracking
 E-commerce company keep important metrics like which days customer was active or did some business. This metrics becomes very important for building models that can be used to improve customer engagement. Such type of metrics is also useful for fraud or risk related usecase.
Investment banks also use such metrics for Stocks/Currency for building trading models etc.

Using simple bit manipulation tricks 30 days of data can be packed in only 4 bytes, so to store whole year of info only 48 bytes are required.

Code snippet


Apart from compact storage this pattern have good data locality because whole thing can be read by processor using single load operation.

Transmission errors
This is another area where bit manipulation shines. Think you are building distributed storage block management software or building some file transfer service,  one of the thing required for such service is to make sure transfer was done properly and no data was lost during transmission. This can be done using bit parity(odd or even) technique, it involves keeping number of '1' bits to odd or even.


Another way to do such type of verification is Hamming_distance. Code snippet for hamming distance for integer values.



Very useful way to keep data integrity with no extra overhead.
Locks
Lets get into concurrency now. Locks are generally not good for performance but some time we have to use it.  Many lock implementation are very heavy weight and also hard to share between programs .In this example we will try to build lock and this will be memory efficient lock, 32 locks can be managed using single Integer.

Code snippet

This example is using single bit setting trick along with AtomicInteger to make this code threadsafe.
This is very lightweight lock. As this example is related to concurrency so this will have some issues due to false sharing and it is possible to address this by using some of the technique mention in scalable-counters-for-multi-core post.

Fault tolerant disk
Lets get into some serious stuff. Assume we have 2 disk and we want to make keep copy of data so that we can restore data incase one of the disk fails, naive way of doing this is to keep backup copy of every disk, so if you have 1 TB then additional 1 TB is required. Cloud provider like Amazon will be very  happy if you use such approach.
Just by using XOR(^) operator we can keep backup for pair of disk on single disk, we get 50% gain.
50% saving on storage expense.

Code snippet testing restore logic.

Disk code is available @ RaidDisk.java

Ring buffer
Ring buffer is very popular data structure when doing async processing , buffering events before writing to slow device. Ring buffer is bounded buffer and that helps in having zero allocation buffer in critical execution path, very good fit for low latency programming.
One of the common operation is finding slot in buffer for write/read and it is done by using Mod(%) operator, mod or divide operator is not good for performance because it stalls execution because CPU has only 1 or 2 ports for processing divide but it has many ports for bit wise operation.

In this example we will use bit wise operator to find mod and it is only possible if mod number is powof2. I think it is one of the trick that everyone should know.

n & (n-1)

If n is power of 2 then 'x & (n-1)' can be used to find mod in single instruction. This is so popular that it is used in many places, JDK hashmap was also using this to find slot in map.



Conclusion
I have just shared at very high level on what is possible with simple bit manipulation techniques.
Bit manipulation enable many innovative ways of solving problem. It is always good to have extra tools in programmer kit and many things are timeless applicable to every programming language.

All the code used in post is available @ bits repo.

Thursday, 25 August 2016

Lazy evaluation

Recently i was writing log4j appender and wanted to use logger in it to log some diagnostic details during custom appender creation, but log4j initialization completes only after appender instance are created, so message logged during this phase are ignored.

I felt the need for lazy initialization in custom appender and started to look at options.

In this blog i will share things that i tried.

One of the thing that came to my mind was Singleton approach but now it is known fact that singleton causes problem with testing and make it impossible to extend it, so approach of mixing concurrency & object construction is not that good.

Incase if singleton is required then it is better to use Dependency Injection framework rather than spoiling your application code.

Lets get back to lazy initialization/eval.

Some programming language like scala/swift etc has support for lazy, so no custom code is required to do this but in java space we still have to write thread safe code to get it right.

Lets look at some options we have in java and what type of performance we get.

- Brute force using Synchronized
This is the most simple and inefficient one, scala is using this approach. Scala one is available @ ScalaLazy.java



- Double lock
This is little complex to write and gives good performance.

- Using Future task
This approach is simple to write and gives good performance.


Double lock approach gives the best performance and brute force one is worst. I did quick bench mark for 1 Million calls using different number of thread.


Single lock performance is very bad, lets have look at the number by removing single lock to see how Double Lock & Future Task performed.


These benchmark are done very quickly but detailed benchmark numbers should be close.


Code for this blog post is available @ github






Thursday, 23 July 2015

Efficiency with Algorithms

Recently had look at excellent talk on Efficiency with Algorithms, Performance with Data Structures , this talk has really some good content on performance.

In this blog i will share some of ideas from above talk and few things that i have learned.

Pre processing 
This is very common trick, this is trade off between processing required when actual requests comes vs time taken to pre compute some thing, some of the good examples are.

IndexOf
This is very common string operation, almost all application needs this algorithm, java have brute force algorithm to solve this but this can become bottle neck very soon , so it is better to use Knuth–Morris–Pratt_algorithm algorithm, which pre computes table to get better performance.

Search
Sequential search vs binary search is trade off between search time or pre processing (i.e sorting) time, it starts to give return if number of compare required goes over 0(log n)

Binary search is heart of so many algorithm, this reduces expensive search operation.

Many String permutation search problems are solve using this.

Hashmap in java has nice optimization when many keys with same hashcode are added, to ameliorate impact, when keys are Comparable, this class may use comparison order among keys to help break ties

Indexes
Lookup is expensive operation and binary search can't answer all types of query,so you should build specialized index for fast search, some of the options are key/value, prefix/suffix , bitmap index, inverted index etc

Pre Allocate
This is classic one, so if you know how big your data structure will be then it is better to pre allocate it rather than keep it expanding multiple times and incur cost of allocation & copy.

Array based data structure doubles capacity every time re-allocation is required and this is done to amortized allocation cost, so they do big and less frequent allocation, in java world ArrayList, HashMap etc are good example of that.

Many design pattern like Object Pooling/Fly Weight etc are based on this.

Reduce Expensive/Duplicate operation
Each algorithm has some expensive operation and to make it efficient those expensive operation should be reduced, some of the common example are

HashCode
Hash code computation is expensive operation, so this should be not be computed multiple times for given key. One way is compute it once and pass it to other functions , so recomputation is not required or pre-compute it for eg String class in java pre compute hash code to save some time.

Modulus /Divide 
This is one of the expensive arithmetic operation, bit map operation can be used to do same thing but at lower cost.
If data structure is circular for eg Circular array and want to put value in free slot then Modulus operation is required and if capacity of array is power of 2 then bit wise operation ( i.e index & ( capacity-1) ) can be used to get Modulus value and this will give significant performance gain at the cost of extra memory.  This technique is used by HashMap in java

Similarly right shift ( >>) operation can be used for divide to get better performance, but now a days compiler are smart, so you get this one for free, no need to write it.

Reduce copy overhead
Array based data structure amortized re-sizing cost by increasing capacity by 2 times , this is good but overhead of copy also comes along with it, another approach is chain of arrays, so this way you only allocate one big chunk but don't have to do copy of old value, just add this new block of memory to chain of allocated blocks.
gs-collection has CompositeFastList which is build using this approach.

Batching
Expensive operation like write to file/network/database etc should be batched to get best performance.

Co-locate data
Keeping all required data together gives very good performance based on how processor works, but this is one thing that is broken in many application.
Mike Acton talk about this in Data-Oriented Design and C++ talk in details.

Column based storage are very good analytic/aggregation use case because most of the time single column data for all rows are required to answer analytic/aggregation request.

Linear probing hash tables are also good example of data co-location.

Bit Packing is another option to keep required data very close. but should be used with extra caution because this can make code complex.

Unnecessary Work
Some of algorithm suffer from this problem most common are recursive algorithm like factorial or Fibonacci etc, both of this has duplicate work problem, which can be fixed by Memoization technique.


Simplicity & readability of code should not be lost during efficiency ride because next guy has to maintain it :-)

Thursday, 7 May 2015

Experiment with String Split

Splitting string based on some token is very common operation in application, in this blog i will share some options that are mostly used and types of overhead involved in it.

String.split
This is the most common approach and it looks harmless unless you look at the code!
First it creates ArrayList for storing values and then this arraylist is converted to String array

This function produces too much of garbage and it provides only one way to extract values.

String Tokenizer 
This is much better than String.split because it does not need intermediate buffer like ArrayList/String Array to hold the values, but it creates String objects for each token which adds to garbage.

One of the good thing about StringTokenizer is that it does not force caller to use fixed data structure , so caller is free to decide on what to do with values, it is just like Streaming operation.

String.split vs StringTokenizer
Lets look at some performance numbers. In this test i take below sample string

String[] values = {
        "this,is,simple,test",
        "lets,see,how,it,works",
        "this,is,very,simple,test"};

Each line is split 10 million times






















So definitely StringTokenizer is winner in this case and it is because it produces less garbage as compared to String.split but it still produces string objects.

It is possible to avoid creating those String object also by using Recycle Charsequence which is just like String but gives lots of flexibility.

Lets look at another implementation using recycle charsequence approach.


It is very simple technique to avoid intermediate string object, lets look at performance number using this approach.






















Recycle charSequence shines in this test, it is around 3X times faster than String.split an 2X times faster than StringTokenizer.

Code available @ github

Saturday, 20 July 2013

ArrayList Using Memory Mapped File

Introduction
In-Memory computing is picking up due to affordable hardware, most of the data is kept in RAM to meet latency and throughput goal, but keeping data in RAM create Garbage Collector overhead especially if you don't pre allocate.
So effectively we need garbage less/free approach to avoid GC hiccups

Garbage free/less data structure
There are couple of option to achieve it
 - Object Pool 
Object pool pattern is very good solution, i wrote about that in Lock Less Object Pool blog

- Off Heap Objects
JVM has very good support for creating off-heap objects. You can get rid of GC pause if you take this highway and highway has its own risk!

-MemoryMapped File
This is mix of Heap & Off Heap, like best of world.

Memory mapped file will allow to map part of the data in memory and that memory will be managed by OS, so it will create very less memory overhead in JVM process that is mapping file.
This can help in managing data in garbage free way and you can have JVM managing large data.
Memory Mapped file can be used to develop IPC, i wrote about that in power-of-java-memorymapped-file blog

In this blog i will create ArrayList that is backed up by MemoryMapped File, this array list can store millions of object and with almost no GC overhead. It sounds crazy but it is possible.

Lets gets in action
In this test i use Instrument object that has below attribute
 - int id
 - double price

So each object is of 12 byte.
This new Array List holds 10 Million Object and i will try to measure writer/read performance

Writer Performance



X Axis - No Of Reading
Y Axis - Time taken to add 10 Million in Ms









Adding 10 Million element is taking around 70 Ms, it is pretty fast.

Writer Throughput
Lets look at another aspect of performance which is throughput


X Axis - No Of Reading
Y Axis - Throughput /Second , in Millions









Writer throughput is very impressive, i ranges between 138 Million to 142 Million

Reader Performance

X Axis - No Of Reading
Y Axis - Time taken to read 10 Million in Ms









It is taking around 44 Ms to read 10 Million entry, very fast. With such type of performance you definitely challenge database.

 Reader Throughput

X Axis - No Of Reading
Y Axis - Throughput /Second , in Millions









Wow Throughput is great it is 220+ million per second

It looks very promising with 138 Million/Sec writer throughput & 220 Million/Sec reader throughput.

Comparison With Array List
Lets compare performance of BigArrayList with ArrayList,

Writer Throughput - BigArrayList Vs ArrayList




 Throughput of BigArrayList is almost constant at around 138 Million/Sec, ArrayList starts with 50 Million and drops under 5 million.

ArrayList has lot of hiccups and it is due to 
 - Array Allocation
 - Array Copy
 - Garbage Collection overhead

BigArrayList is winner in this case, it is 7X times faster than arraylist.

Reader Throughput - BigArrayList Vs ArrayList

ArrayList performs better than BigArrayList, it is around 1X time faster.

BigArrayList is slower in this case because
 - It has to keep mapping file in memory as more data is requested
 - There is cost of un-marshaling

Reader Throughput for BigArrayList is 220+ Million/Sec, it is still very fast and only few application want to process message faster than that.
So for most of the use-case this should work.

Reader performance can be improved by using below techniques 
 - Read message in batch from mapped stream
 - Pre-fetch message by using Index, like what CPU does

By doing above changes we can improve performance by few million, but i think for most of the case current performance is pretty good

Conclusion
Memory mapped file is interesting area to do research, it can solve many performance problem.
Java is now being used for developing trading application and GC is one question that you have to answer from day one, you need to find a way to keep GC happy and MemoryMapped is one thing that GC will love it.

Code used for this blog is available @ GitHub , i ran test with 2gb memory.
Code does't handle some edge case , but good enough to prove the point that that MemoryMapped file can be winner in many case.

Sunday, 19 May 2013

Lock Less Java Object Pool

It is being while i wrote anything, i has been busy with my new job that involve doing some interesting work in performance tuning. One of the challenge is to reduce object creation during critical part of application.

Garbage Collection hiccups has been main pain point in java for some time, although java has improved over time with GC algorithmic. Azul is market leader developing pause less GC but Azul JVM are not free as speech!

Creating too many temporary/garbage object does't work too well because it create work for GC and it is going to have effect on latency, too much garbage also does't work well with multi core system because it causes cache pollution.

So how should we fix this ?

Garbage less coding
This is only possible if you know how many object you need upfront and pre-allocate them, but in reality that is very difficult to find that , but in-case if you still managed to do that then you have to worry about another issue

  • You might not have enough memory to hold all the object you need
  • You have to handle concurrency
So what is the solution for above problem
There is Object Pool design pattern that can address both of the above issue,it lets you to specify num of object that you need in pool and handles concurrent request to serve object request.

Object Pool has been base of many application that has low latency requirement, flavor of object pool is Flyweight design pattern.

Both of above pattern will help us in avoiding object creation, that is great so now GC work is reduced and in theory our application performance should improve but in practical does't happen that way because Object Pool/Flyweight has to handle concurrency and  whatever advantage you get by avoiding object creation is lost because of concurrency issue.

What are most common way to handle concurrency
Object pool is typical producer/consumer problem and it can be solved by using following techniques 

Synchronized - This was the only way to handle concurrency before JDK 1.5, apache has written wonder full object pool API based on synchronized 

Locks   - Java added excellent support for concurrent programming since JDK 1.5, there has been some work to use Locks to develop Object Pool for eg furious-objectpool

Lock Free - I could not find any implementation that is built using fully lock free technique, but furious-objectpool use mix of ArrayBlocking queue & ConcurrentLinked queue

Lets measure performance
In this test i have created pool of 1 Million object and those object are accessed by different pool implementation, objects are taken from pool and return back to pool.

This test first starts with 1 thread and then number of threads are increased to measure how different pool implementation perform under contention


.
X Axis - No Of Threads
Y Axis - Time in Ms - Lower time is better

This test include pool from Apache, Furious Pool & ArrayBlocking based Pool

Apache one is worst and as number of threads increase performance degrades further and reason for same is Apache pool is based on heavy use of "synchronized" 

Other two Furious & ArrayBlocking based pool performs better but both of them also slows down as contention increase. 

ArrayBlocking queue based pool takes around 1000 ms for 1 Million items when 12 threads are trying to access the pool, Furious pool which internally uses Arrayblocking queue takes around 1975 ms for same thing. 

I have to do some more detail investigation to find out why Furious is taking double time because it is also based on ArrayBlocking queue.

Performance of arrayblocking queue is decent but it is lock based approach, what type of performance we can get if we can implement lock free pool.

Lock free pool.
Implementing lock free pool is not impossible but bit difficult because you have to handle multiple producer & consumer.

I will implement hybrid pool which will use lock on the producer side & non blocking technique on the consumer side.

Lets have look some numbers 

I performed same test with new implementation (FastPool) and it is almost 30% faster than ArayBlocking queue.

30% improvement is not bad, it can definitely help is meeting latency goal.

What makes Fast Pool fast!
I used couple of technique to make it fast
  • Producer are lock based - Multiple producer are managed using locks, this is same as Array Blocking queue, so nothing great about this.


  • Immediate publication of released item - it publishes element before lock is released using cheap memory barrier. This gives some gain


  • Consumer are non blocking - CAS is used to achieve this, consumer are never blocked due to producer. Array Blocking queue blocks consumer because it use same lock for producer & consumer


  • Thread Local to maintain value locality -  Thread Local is used to acquire last value that was used, this reduces contention to great extent. 
If you are interested in having look at code then it is available @ FastObjectPool.java