Showing posts with label Lockfree. Show all posts
Showing posts with label Lockfree. Show all posts

Saturday, 14 September 2013

Concurrent Counter With No False Sharing

This blog is continuation of Scalable Counter post.

One of the reader shared result from his system. He ran test on XEON intel processor with 16 core and total time taken for each type of counter is almost same, although Atomic counter has CAS failure & other type does't has any CAS failure but it does't make any difference in execution time.
Very strange result , needs further investigation.

Another reader pointed out that it could be due to false sharing, so it worth to take that into account and i created another class that handles FALSE SHARING

Time take by different counter
 Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

PaddedAtomicCounter is new type of counter that i have added to test & it it outperforms all other counters.
It is using cache line padding to avoid false sharing.
Cacheline is of 64 byte on most of the today processor. PaddedCounter it Integer based counter so it is adding 16 slot per counter, by using this techniques we avoid cache pollution & as result we see 16X times gain as compared to AtomicCounter, without padding the gain was 5X and with cache line padding gain jumps to 16X.
With cacheline padding you need some extra space, so it is trade off of memory for speed, you can choose what you want!

CAS failure rate
Lets look at the CAS failure for different counter & what it means for performance.


Y Axis - CAS Failure in 100Ks

X Axis - Number of threads

PaddedAtomic has some CAS failure as compared to other counters , but it does't make any difference in execution time of the counter.
CAS failure is not the only factory that can determined execution time, false sharing make significant contribution to it, so this gives good explanation of behavior seen in XEON processor

Conclusion
To get better performance you have to take care of few things

 -  Contention  - There are many techniques to avoid it , this blogs shows one of them

 -    False Sharing - You have to avoid false sharing to get best out of processor, padding is required for that. Some of the JDK classes are using padding are ThreadLocalRandom , now we have @Contended annotation from java to achive same thing, it is being used in ForkAndJoinPool

Code is available @ github

   

Tuesday, 10 September 2013

Scalable Counters For Multi Core

Counters are required everywhere , for e.g. to find key KPI of application, load on application, total number of request served, some KPI for finding throughput of application & many more.

With all these requirement complexity of concurrency is also added & that makes this problem interesting.

How to implement concurrent counter

 - Synchronized - This was the only option before JDK 1.5, since now we are waiting for JDK8 release , so definitely this is not an option.

- Lock based  - You should never attempt this for counter , it will perform very badly

- Wait Free - Java does't have support for Fetch-and-add, so bit difficult to implement it.

- Lock free - With very good support of Compare-and-swap, this looks good option to use.

How does Compare-and-Swap based counter performs

I used AtomicInteger for this test and this counter is incremented for 1 Million time by each thread & to increase the contention number of threads are increased gradually.

Test Machine Details
OS : Windows 8
JDK : 1.7.0.25
CPU : Intel i7-3632QM , 8 Core
RAM : 8 GB











Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

As number of threads are increased, time taken to increment counter increases and it is due to contention.
For CAS based counter , it is CAS failure that causes slowdown.

Is this best performance that we can get ? no definitely their are better solution to implement concurrent counter, lets have look at them.

Alternate Concurrent Counter
Lets look at some of the solution to implement counter that handles contention in better way

- Core Based Counter - Maintain counter for each logical core, so that way you will have less contention. Only issue you have this type of counter is that if number of threads are more than logical core then you will start noticing contention.

- Thread Based Counter - Maintain counters for total number of threads that will be using system. This works well when number of threads are more than number of logical cores.


Lets test it

Time taken by different types of counter










Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

Concurrent Counter performs much better than Atomic based counter, for 16 threads it is around 5X times better, that is huge difference!

CAS Failure Rate

Y Axis - CAS Failure in 100Ks

X Axis - Number of threads

Due to contention, Atomic based counter see lot of failure and it goes up exponential as i add more threads & other counters performs pretty well.

Observation
  Multi core machines becoming easily available & we have to change the way we handle concurrency, traditional way of doing concurrency is not going to scale in today's time when having 24 or 48 core server is very common.

 - To reduce the contention you have to use multiple counters and then aggregate them later

 - Core based counter works well if number of threads will be less or same as number of cores

 - Thread based counter is good when number of threads are much more than available core

 - Key to reduce contention is identify counter to which thread will write,i have used simple approach based on thread id, but much better approach are available, look at ThreadLocalRandom of JDK 8 for some ideas.

 - Thread based approach is used in LongAdder of JDK8, which creates many slots to reduce contention.

Code for all the counters used in this test are available @ Github

Tuesday, 25 December 2012

Are you using correct logging framework ?

Logging framework is heart of every application, it helps in troubleshooting production issue,knowing how your application is being use,what are bottle neck in application and may more.

Using right logging framework is key, Wikipedia has list of famous one in java world.

So logging has lot of benefit  but it also brings lot of overhead ,so you do trade-off for benefit that you get. It is interesting to measure cost of overhead and we all know it has big overhead because it is I/O bound, so we come with best strategy on what to log, how much to log, which one is best framework etc.

One problem with current logging framework is that it , not much work has been done on performance improvement, especially if you talk about taking advantage of multi core architecture. Application try to do more work to take advantage of multiple cores and that may result into more..... log message and since log framework are not their yet, so they don't scale.

Cost Of Log Message

Lets measure time spent in logging. 
I wrote java program that sums the number of element in array and it is done in parallel. Array is divided in chunk and each task sums that chunk and log message before & after it process chunk. I have taken simple computation problem because we want to measure cost of logging.

For bench marking purpose , i do calculation with/without logging to check how much time we spend on logging and numbers are crazy.

Java JDK 1.4 logging framework(java.util.logging) is used  for benchmark, i only took java one because all other like log4j , apache logging , SLFJ etc are almost same when they log message to file.

Details of Machine
OS : Windows 7, 64 bit
Processor : Intel i5, 2.40 GHz
No Of Cores : 4 



Numbers are really crazy, once you add logging, performance drops by 6X to 10X times, it is not worth it to use logging.

More time is spent in logging than doing real work, but fact of life is we still need logging and if it is not there then it will be nightmare to troubleshoot production issue.

Why it is so slow!
To find out why is super slow, we have to dive into the code, so lets have look at default handler(FileHandler) that java logging framework provides.

FileHandler has synchronized function that perform core logging and we know what type of performance degradation you can get in multi threaded env, so that is number 1 reason for slowness.

public synchronized void publish(LogRecord record) {
        if (!isLoggable(record)) {
            return;
        }
        super.publish(record);
        flush();
       .........
      ..........
      ...........
    }


Other reason for slowness is that I/O operation is performed for each and every call of log message, there is no buffering done, not sure why it is done like that may be to keep it simple and all these simplicity adds to performance cost. 

All of these techniques result in lot of contention and we see big degrade in performance by harm-less looking code.

 What can we do now ?
So we have to find some alternate framework that does't make our application 10X times slow!
 We have to do couple of quick things.
 - Get rid of synchronized
 - Try to add some I/O buffering
 - Use Async for logging

So i did add these stuff and created simple logging util and measured the cost of same.

So the light green one is the logger that has some of improvement and it is amazing to see that we are back to same performance and it is not adding any significant overhead, it is almost as there is no logging in code.

So by just switching logging framework we can get up to 6X to 10X benefit.

What gives up these performance improvement
  - ConcurrentLinkedQueue is used to store all the log message, ConcurrentLinkedQueue is lock free queue but it is not bounded. Bouded queue can be used to reduce risk where producer is faster than consumer.

   - Buffering is added for I/O operation, these reduces I/O operation

   - Prealocated bytebuffer is used to keep all the message that needs to writen to file.

  - Lock based wait strategy is used when there is no message in queue, better waiting strategy can be used based on requirement, but for logs i think lock based are fine.

Conclusion 
So time has come to re-think logging framework used, just using by using correct logging framework we can get big performance boost. To make application more responsive, we have to have to look for alternate. This is very important in low latency & high throughput space.

Example that i used is very basic, it does't have all the features, but can be easily extended to add those stuff. 


Link To Code
Code available @ github

Friday, 30 November 2012

Lock Free bounded queue

Lock free bounded queue

JDK 1.5 was first step towards adding serious concurrency support to java, it changed  the way you write concurrent program and in other version 1.6, 1.7 etc more concurrency support. 

One of the important building block for concurrent programming is Queue, java has lot of them ArrayBlockingQueue, ArrayDeque, ConcurrentLinkedQueue, DelayQueue, LinkedBlockingDeque, LinkedBlockingQueue, PriorityBlockingQueue, PriorityQueue, SynchronousQueue.
I am sure you will find one Queue that will fit your need!

Locks are used in most of the queue for managing concurrent access to data structure because of obvious reason of simplicity to use, lock based strategy makes it difficult to develop low latency system, it could result in priority inversion, context switching, deadlock etc 
Alternate way is using Non Blocking Algorithm , which is bit difficult to develop, one of the non blocking algorithm is lock free algorithm

Java added support for look free via java.util.concurrent.atomic package, so now time for some action with lock free api.

Java 1.5+ has lock free queue ConcurrentLinkedQueue, which is very fast but it is unbounded and such type queue can be very risky to use real application because most of the time queue is not balanced, it is either full or empty and if it is unbounded then system can run out of resource.

I will implement lock free bounded queue and measure it performance against lock based bounded queue(ArrayBlocking), lock free unbounded queue(ConcurrentLinkedQueue). I chose these queue because they are fastest queue available in JDK.

Number Game




  Test Envinorment Details
  OS : Windows 7, 64 bit
  Processor : Intel (R) Core(TM) i7-2820QM CPU @ 2.30 GHz
  Processor Arch : Sandy Bridge , 4 physical cores/8 threads

  For this test 10 Million message was added to queue using single producer and consumer number are in range of 1 to 5, non blocking call(offer/poll) call is used to produce/consume message from queue, performance is best when there is 1 consumer and as number of consumer increase performance degrades, so it is evident that contention is the culprit. 

  Cpu Usage

Cpu usage for LockFreeBoundedQueue





Cpu usage for ArrayBlockingQueue

From CPU usage it is clear that contention is causing drop in cpu usage and with lock free queue CPU usage is in acceptable range.

Test Code of LockFreeBoundedQueue

Link To Code

Above code is just an idea to demonstrate that how powerful non blocking algorithm can be. This code can be enhanced to add more feature like multiple producer, support for blocking call put/take.
Before adding blocking support(put/take) to queue we have to think about blocking strategy, there are couple of option spinning, hybrid spinning etc.