Showing posts with label Low Latency. Show all posts
Showing posts with label Low Latency. Show all posts

Tuesday, 23 June 2020

Bit fiddling every programmer should know

Bit fiddling looks like magic, it allows to do so many things in very efficient way.
In this post i will share some of the real world example where bit operation can be used to gain good performance.

Technology Basics: Bits and Bytes - Business Technology, Gadgets ...
Bit wise operation bootcamp
Bit operator include.
 - AND ( &)
 - OR ( | )
 - Not ( ~)
 - XOR( ^)
 - Shifts ( <<, >>)

Wikipedia has good high level overview of Bitwise_operation. While preparing for this post i wrote learning test and it is available learningtest github project. Learning test is good way to explore anything before you start deep dive. I plan to write detail post on Learning Test later.

In these examples i will be using below bits tricks as building block for solving more complex problem.
  • countBits  - Count number of 1 bits in binary
  • bitParity - Check bit added to binary code
  • set/clear/toggle - Manipulating single bit
  • pow2 - Find next power of 2 and using it as mask.

Code for these function is available @ Bits.java on github and unit test is available @ BitsTest.java

Lets look at some real world problems now.

Customer daily active tracking
 E-commerce company keep important metrics like which days customer was active or did some business. This metrics becomes very important for building models that can be used to improve customer engagement. Such type of metrics is also useful for fraud or risk related usecase.
Investment banks also use such metrics for Stocks/Currency for building trading models etc.

Using simple bit manipulation tricks 30 days of data can be packed in only 4 bytes, so to store whole year of info only 48 bytes are required.

Code snippet


Apart from compact storage this pattern have good data locality because whole thing can be read by processor using single load operation.

Transmission errors
This is another area where bit manipulation shines. Think you are building distributed storage block management software or building some file transfer service,  one of the thing required for such service is to make sure transfer was done properly and no data was lost during transmission. This can be done using bit parity(odd or even) technique, it involves keeping number of '1' bits to odd or even.


Another way to do such type of verification is Hamming_distance. Code snippet for hamming distance for integer values.



Very useful way to keep data integrity with no extra overhead.
Locks
Lets get into concurrency now. Locks are generally not good for performance but some time we have to use it.  Many lock implementation are very heavy weight and also hard to share between programs .In this example we will try to build lock and this will be memory efficient lock, 32 locks can be managed using single Integer.

Code snippet

This example is using single bit setting trick along with AtomicInteger to make this code threadsafe.
This is very lightweight lock. As this example is related to concurrency so this will have some issues due to false sharing and it is possible to address this by using some of the technique mention in scalable-counters-for-multi-core post.

Fault tolerant disk
Lets get into some serious stuff. Assume we have 2 disk and we want to make keep copy of data so that we can restore data incase one of the disk fails, naive way of doing this is to keep backup copy of every disk, so if you have 1 TB then additional 1 TB is required. Cloud provider like Amazon will be very  happy if you use such approach.
Just by using XOR(^) operator we can keep backup for pair of disk on single disk, we get 50% gain.
50% saving on storage expense.

Code snippet testing restore logic.

Disk code is available @ RaidDisk.java

Ring buffer
Ring buffer is very popular data structure when doing async processing , buffering events before writing to slow device. Ring buffer is bounded buffer and that helps in having zero allocation buffer in critical execution path, very good fit for low latency programming.
One of the common operation is finding slot in buffer for write/read and it is done by using Mod(%) operator, mod or divide operator is not good for performance because it stalls execution because CPU has only 1 or 2 ports for processing divide but it has many ports for bit wise operation.

In this example we will use bit wise operator to find mod and it is only possible if mod number is powof2. I think it is one of the trick that everyone should know.

n & (n-1)

If n is power of 2 then 'x & (n-1)' can be used to find mod in single instruction. This is so popular that it is used in many places, JDK hashmap was also using this to find slot in map.



Conclusion
I have just shared at very high level on what is possible with simple bit manipulation techniques.
Bit manipulation enable many innovative ways of solving problem. It is always good to have extra tools in programmer kit and many things are timeless applicable to every programming language.

All the code used in post is available @ bits repo.

Saturday, 17 August 2019

JVM with no garbage collection

JVM community keeps on adding new GC and recently new one was added and it is called Epsilon and is very special one. Epsilon only allocates memory but will not reclaim any memory.

Image result for garbage collection

It might look like what is use of GC that does not perform any garbage collection. This type of Garbage Collector has special use and we will look into some.

Where this shinny GC can be used ?
Performance Testing

If you are developing solution that has tight latency requirement and limited memory budget then this GC can be used to test limit of program.

Memory Pressure Testing
Want to know extract transient memory requirement by your application. I find this useful if you are building some pure In-Memory solution.

Bench marking Algorithm.
Many time we want to test the real performance of new cool algorithm based on our understanding of BIG (O) notion but garbage collector adds noise during testing.

Low Garbage
Many times we do some optimization in algorithm to reduce garbage produced and GC like epsilon helps in scientific verification of optimization.

How to enable epsilon GC

JVM engineers have taken special care that this GC should not enabled by default in production , so to use this GC we have to use below JVM options

-XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -Xlog:gc


One question that might be coming in your mind what happens when memory is exhausted ? JVM will stop with OutofMemory Error.

Lets look at some code to test GC

How to know if epsilon is used in JVM process?

Java has good management API that allows to query current GC being used, this can also be used to verify what is the default GC in different version of java.

Run above code with below options
-XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC VerifyCurrentGC

How does code behave when memory is exhausted. 

I will use below code to show how new GC works.

Running above code with default GC and requesting 5GB allocation causes no issue (java -Xlog:gc -Dmb=5024 MemoryAllocator) and it produces below output

[0.016s][info][gc] Using G1
[0.041s][info][gc] Periodic GC disabled
Start allocation of 5024 MBs
[0.197s][info][gc] GC(0) Pause Young (Concurrent Start) (G1 Humongous Allocation) 116M->0M(254M) 3.286ms
[0.197s][info][gc] GC(1) Concurrent Cycle
[0.203s][info][gc] GC(1) Pause Remark 20M->20M(70M) 4.387ms
[0.203s][info][gc] GC(1) Pause Cleanup 22M->22M(70M) 0.043ms
[1.600s][info][gc] GC(397) Concurrent Cycle 6.612ms
[1.601s][info][gc] GC(398) Pause Young (Concurrent Start) (G1 Humongous Allocation) 52M->0M(117M) 1.073ms
[1.601s][info][gc] GC(399) Concurrent Cycle
I was Alive after allocation
[1.606s][info][gc] GC(399) Pause Remark 35M->35M(117M) 0.382ms

[1.607s][info][gc] GC(399) Pause Cleanup 35M->35M(117M) 0.093ms
[1.607s][info][gc] GC(399) Concurrent Cycle 6.062ms

Lets add some memory limit ( java -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -Xlog:gc -Xmx1g -Dmb=5024 MemoryAllocator)

[0.011s][info][gc] Resizeable heap; starting at 253M, max: 1024M, step: 128M
[0.011s][info][gc] Using TLAB allocation; max: 4096K
[0.011s][info][gc] Elastic TLABs enabled; elasticity: 1.10x
[0.011s][info][gc] Elastic TLABs decay enabled; decay time: 1000ms
[0.011s][info][gc] Using Epsilon
Start allocation of 5024 MBs
[0.147s][info][gc] Heap: 1024M reserved, 253M (24.77%) committed, 52640K (5.02%) used
[0.171s][info][gc] Heap: 1024M reserved, 253M (24.77%) committed, 103M (10.10%) used
[0.579s][info][gc] Heap: 1024M reserved, 1021M (99.77%) committed, 935M (91.35%) used
[0.605s][info][gc] Heap: 1024M reserved, 1021M (99.77%) committed, 987M (96.43%) used

Terminating due to java.lang.OutOfMemoryError: Java heap space

This particular run caused OOM error and is good confirmation that after 1GB this program will crashed.

Same behavior is true multi thread program also, refer to MultiThreadMemoryAllocator.java for sample.

Unit Tests are available to test features of this special GC.

I think Epsilon will find more use case and adoption in future and this is definitely a good step to increase reach of JVM.

All the code samples are available Github repo

If you like the post then you can follow me on twitter .


Monday, 25 June 2018

Inside Simple Binary Encoding (SBE)


SBE is very fast serialization library which is used in financial industry, in this blog i will go though some of the design choices that are made to make it blazing fast.

Whole purpose of serialization is to encode & decode message and there many options available starting from XML,JSON,Protobufer , Thrift,Avro etc.

XML/JSON are text based encoding/decoding, it is good in most of the case but when latency is important then these text based encoding/decoding become bottleneck.

Protobuffer/Thrift/Avro are binary options and used very widely.

SBE is also binary and was build based on Mechanical sympathy to take advantage of underlying hardware(cpu cache, pre fetcher, access pattern, pipeline instruction etc).

Small history of the CPU & Memory revolution.
Our industry has seen powerful processor from 8 bit, 16 , 32, 64 bit and now normal desktop CPU can execute close to billions of instruction provided programmer is capable to write program to generate that type of load. Memory has also become cheap and it is very easy to get 512 GB server.
Way we program has to change to take advantage all these stuff, data structure & algorithm has to change.


Lets dive inside sbe.

Full stack approach

Most of the system rely on run-time optimization but SBE has taken full stack approach and first level of optimization is done by compiler.


Schema - XML file to define layout & data type of message.
Compiler - Which takes schema as input and generate IR. Lot of magic happen in this layer like using final/constants, optimized code.
Message - Actual message is wrapper over buffer.

Full stack approach allows to do optimization at various level.

No garbage or less garbage
This is very important for low latency system and if it is not taken care then application can't use CPU caches properly and can get into GC pause.

SBE is build around flyweight pattern, it is all about reuse object to reduce memory pressure on JVM.

It has notion of buffer and that can be reused, encoder/decoder can take buffer as input and work on it. Encoder/Decoder does no allocation or very less(i.e in case of String).

SBE recommends to use direct/offheap buffer to take GC completely out of picture, these buffer can be allocated at thread level and can be used for decoding and encoding of message.

Code snippet for buffer usage.

 final ByteBuffer byteBuffer = ByteBuffer.allocateDirect(4096);
 final UnsafeBuffer directBuffer = new UnsafeBuffer(byteBuffer);

tradeEncoder        .tradeId(1)
        .customerId(999)
        .qty(100)
        .symbol("GOOG")
        .tradeType(TradeType.Buy);
   

Cache prefetching

CPU has built in hardware based prefetcher. Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed.
Accessing data from fast CPU cache is many orders of magnitude faster than accessing from main memory.

latency-number-that-you-should-know blog post has details on how fast CPU cache can be.

Prefetching works very well if algorithm is streaming and underlying data used is continuous like array. Array access is very fast because it sequential and predictable

SBE is using array as underlying storage and fields are packed in it.




Data moves in small batches of cache line which is usually 8 bytes, so if application asks for 1 byte it will get 8 byte of data. Since data is packed in array so accessing single byte prefetch array content in advance and it will speed up processing. 

Think of prefetcher as index in database table. Application will get benefit if reads are based on those indexes.

Streaming access
SBE supports all the primitive types and also allows to define custom types with variable size, this allows to have encoder and decoder to be streaming and sequential. This has nice benefit of reading data from cache line and decoder has to know very little metadata about message(i.e offset and size).

This comes with trade off read order must be based on layout order especially if variable types of data is encoded.

For eg Write is doing using below order 
tradeEncoder        .tradeId(1)
        .customerId(999)
        .tradeType(TradeType.Buy)
        .qty(100)
        .symbol("GOOG")
        .exchange("NYSE");


For String attributes(symbol & exchange) read order must be first symbol and then exchange, if application swaps order then it will be reading wrong field, another thing read should be only once for variable length attribute because it is streaming access pattern.

Good things comes at cost!

Unsafe API
Array bound check can add overhead but SBE is using unsafe API and that does not have extra bound check overhead.

Use constants on generated code
When compiler generate code, it pre compute stuff and use constants. One example is field off set is in the generated code, it is not computed.

Code snippet 

     public static int qtyId()
    {
        return 2;
    }

    public static int qtySinceVersion()
    {
        return 0;
    }

    public static int qtyEncodingOffset()
    {
        return 16;
    }

    public static int qtyEncodingLength()
    {
        return 8;
    }

This has trade-off, it is good for performance but not good for flexibility. You can't change order of field and new fields must be added at end.
Another good thing about constants is they are only in generated code they are not in the message to it is very efficient.

Branch free code
Each core has multiple ports to do things parallel and there are few instruction that choke like branches, mod, divide. SBE compiler generates code that is free from these expensive instruction and it has basic pointer bumping math.
Code that is free from expensive instruction is very fast and will take advantage of all the ports of core.
Sample code for java serialization

public void writeFloat(float v) throws IOException {
    if (pos + 4 <= MAX_BLOCK_SIZE) {
        Bits.putFloat(buf, pos, v);        pos += 4;    } else {
        dout.writeFloat(v);    }
}

public void writeLong(long v) throws IOException {
    if (pos + 8 <= MAX_BLOCK_SIZE) {
        Bits.putLong(buf, pos, v);        pos += 8;    } else {
        dout.writeLong(v);    }
}

public void writeDouble(double v) throws IOException {
    if (pos + 8 <= MAX_BLOCK_SIZE) {
        Bits.putDouble(buf, pos, v);        pos += 8;    } else {
        dout.writeDouble(v);    }
}

Sample code for SBE
public TradeEncoder customerId(final long value)
{
    buffer.putLong(offset + 8, value, java.nio.ByteOrder.LITTLE_ENDIAN);    return this;}

public TradeEncoder tradeId(final long value)
{
    buffer.putLong(offset + 0, value, java.nio.ByteOrder.LITTLE_ENDIAN);    return this;}


Some numbers on message size.

Type class marshal.SerializableMarshal -> size 267
Type class marshal.ExternalizableMarshal -> size 75
Type class marshal.SBEMarshall -> size 49

SBE is most compact and very fast, authors of SBE claims it is around 20 to 50X times faster than google proto buffer.

SBE code is available @ simple-binary-encoding
Sample code used in blog is available @ sbeplayground


Saturday, 20 July 2013

ArrayList Using Memory Mapped File

Introduction
In-Memory computing is picking up due to affordable hardware, most of the data is kept in RAM to meet latency and throughput goal, but keeping data in RAM create Garbage Collector overhead especially if you don't pre allocate.
So effectively we need garbage less/free approach to avoid GC hiccups

Garbage free/less data structure
There are couple of option to achieve it
 - Object Pool 
Object pool pattern is very good solution, i wrote about that in Lock Less Object Pool blog

- Off Heap Objects
JVM has very good support for creating off-heap objects. You can get rid of GC pause if you take this highway and highway has its own risk!

-MemoryMapped File
This is mix of Heap & Off Heap, like best of world.

Memory mapped file will allow to map part of the data in memory and that memory will be managed by OS, so it will create very less memory overhead in JVM process that is mapping file.
This can help in managing data in garbage free way and you can have JVM managing large data.
Memory Mapped file can be used to develop IPC, i wrote about that in power-of-java-memorymapped-file blog

In this blog i will create ArrayList that is backed up by MemoryMapped File, this array list can store millions of object and with almost no GC overhead. It sounds crazy but it is possible.

Lets gets in action
In this test i use Instrument object that has below attribute
 - int id
 - double price

So each object is of 12 byte.
This new Array List holds 10 Million Object and i will try to measure writer/read performance

Writer Performance



X Axis - No Of Reading
Y Axis - Time taken to add 10 Million in Ms









Adding 10 Million element is taking around 70 Ms, it is pretty fast.

Writer Throughput
Lets look at another aspect of performance which is throughput


X Axis - No Of Reading
Y Axis - Throughput /Second , in Millions









Writer throughput is very impressive, i ranges between 138 Million to 142 Million

Reader Performance

X Axis - No Of Reading
Y Axis - Time taken to read 10 Million in Ms









It is taking around 44 Ms to read 10 Million entry, very fast. With such type of performance you definitely challenge database.

 Reader Throughput

X Axis - No Of Reading
Y Axis - Throughput /Second , in Millions









Wow Throughput is great it is 220+ million per second

It looks very promising with 138 Million/Sec writer throughput & 220 Million/Sec reader throughput.

Comparison With Array List
Lets compare performance of BigArrayList with ArrayList,

Writer Throughput - BigArrayList Vs ArrayList




 Throughput of BigArrayList is almost constant at around 138 Million/Sec, ArrayList starts with 50 Million and drops under 5 million.

ArrayList has lot of hiccups and it is due to 
 - Array Allocation
 - Array Copy
 - Garbage Collection overhead

BigArrayList is winner in this case, it is 7X times faster than arraylist.

Reader Throughput - BigArrayList Vs ArrayList

ArrayList performs better than BigArrayList, it is around 1X time faster.

BigArrayList is slower in this case because
 - It has to keep mapping file in memory as more data is requested
 - There is cost of un-marshaling

Reader Throughput for BigArrayList is 220+ Million/Sec, it is still very fast and only few application want to process message faster than that.
So for most of the use-case this should work.

Reader performance can be improved by using below techniques 
 - Read message in batch from mapped stream
 - Pre-fetch message by using Index, like what CPU does

By doing above changes we can improve performance by few million, but i think for most of the case current performance is pretty good

Conclusion
Memory mapped file is interesting area to do research, it can solve many performance problem.
Java is now being used for developing trading application and GC is one question that you have to answer from day one, you need to find a way to keep GC happy and MemoryMapped is one thing that GC will love it.

Code used for this blog is available @ GitHub , i ran test with 2gb memory.
Code does't handle some edge case , but good enough to prove the point that that MemoryMapped file can be winner in many case.

Sunday, 9 June 2013

CPU cache access pattern

Things to know about low latency

If you are developing low latency application then you have to remove bottleneck that can cause your latency to increase, some of the bottleneck that comes to my mind

 - Input/Output - I/O has big effect in latency it will cause CPU to stall. SSD disk may be one option, but still you want to keep this out of core processing logic.

 - Network - Same effect as I/O, so you have to reduce Network read/write in your core logic, there are many software and hardware solution to improve this.

- Locks in code - Using non-blocking techniques.

- Keeping data in RAM - I will explore this option in blog.

So we can keep all data in RAM to get best latency possible. RAM is very cheap, so lets put every thing in RAM. In theory you should be able to keep CPU busy if all the data that you need for computing is available in RAM, but in practical it doesn't happen that way because of underlying datastructure used.
CPU stalls for some time due to slow access of RAM.

Experiment
Lets try simple experiment to prove that keeping all data in RAM is just not enough to achieve low latency.

Problem:
25 million instrument object are stored in memory and each object has 2 attribute
 - Instrument Id
 - Price 
Sets of input price is passed as input and all the instrument matching that price will be returned. It is a simple search based on price attribute.

Structure Of Instrument Object
Two types of instrument object is created for this test.

 - As simple java object
Instrument object is simple java class with 2 attribute

 class Instrument
{
public int id;
private int price;
public Instrument(int id, int price) {
super();
this.id = id;
this.price = price;
}

public int getId() {
return id;
}

public int getPrice() {
return price;
}
}

- Column type object
For each attribute array is created, this structure is more CPU cache friendly, we will why is is so later.
Sample ColumnInstrument Object

 class ColumnInstrumentStore
{

private int[] ids;
private int[] prices;

public ArrayInstrumentStore(int size)
{
ids = new int[size];
prices = new int[size];
}


public int getId(int index) {
return ids[index];
}


public int getPrice(int index) {
return prices[index];
}
public void setValue(int index, int id, int price) {
ids[index] = id;
prices[index] = price;
}

}

Lets measure performance
Lets try to measure search cost for 2 types of instrument object,

 X axis - No of reading
 Y axsis - Time taken in Ms

Linear search is performed on 25 million instrument object.
Instrument object defined as column is around 5% to 25% faster as compared to normal java based instrument object, on average it is 15% faster as compared to java based object.

Lets look at another number - Time take for each iteration. Time taken for 1 iteration is very less so i have used nano second to measure that.



 X axis - No of reading
 Y axis - Time taken in NanoSecond

This explains why search on column based objects is fast, time spent per iteration is around 15% less .

Why Column based object is fast
Lets look at the reason why search performs better on Column layout type of object as compared to normal object.

 - CPU Pre-Fetching - Today almost all the processor supports hardware pre-fetch. CPU pre-fetch data that is required by application to reduce memory latency

 - Better use of CPU cache - CPU moves data from main memory to cache via cache lines, in current generation of CPU, cache line is of 64 byte, so if you ask for 1 byte of data, processor will still bring 64 byte of data from ram to CPU cache line. This is called - Spatial locality, definition from wikipedia -

Spatial locality: if a particular memory location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future. In this case it is common to attempt to guess the size and shape of the area around the current reference for which it is worthwhile to prepare faster access

This CPU behavior can be used to reduce round trip to ram by co-locating you data.

Lets take Instrument object to understand layout

- Object Layout


- Column Layout


In case of column based instrument, data is stored in integer array, if program request for single value of array , it will get next 15 element of array due to cache line transfer, but in case of normal object it will get only 7 instrument price.
Normal object layout requires double round trip to memory and that is the reason why it is slow.
If you co locate your objects then you can use CPU cache much more efficiently.

Most of the real object are more complex than Instrument object that i have used and you can imagine number of round trip it require to RAM.
You can get more details of memory latency from latency-number-that-you-should-know
Just to put those number in context -

L1 Cache : 0.5 ns
Main Memory  : 100 ns ( 20x times more than L1)


How it performs in multi-threaded world.
In today's time it is impossible to get machine with single core, we live in multi core world!
Each logical core has its own private cache, so if you data structure is cache friendly then it will scale linearly as you add more threads/core to your processing.

Lets look at some number for multi threaded search.

Search latency


Loop Iteration

Both search & iteration per second is improving as more number of threads are added, this will become more effective for NUMA machines because memory access is not uniform.

Just by having CPU cache friendly layout you can improve application performance
Column layout have more benefit, more on next blog.

Tuesday, 25 December 2012

Are you using correct logging framework ?

Logging framework is heart of every application, it helps in troubleshooting production issue,knowing how your application is being use,what are bottle neck in application and may more.

Using right logging framework is key, Wikipedia has list of famous one in java world.

So logging has lot of benefit  but it also brings lot of overhead ,so you do trade-off for benefit that you get. It is interesting to measure cost of overhead and we all know it has big overhead because it is I/O bound, so we come with best strategy on what to log, how much to log, which one is best framework etc.

One problem with current logging framework is that it , not much work has been done on performance improvement, especially if you talk about taking advantage of multi core architecture. Application try to do more work to take advantage of multiple cores and that may result into more..... log message and since log framework are not their yet, so they don't scale.

Cost Of Log Message

Lets measure time spent in logging. 
I wrote java program that sums the number of element in array and it is done in parallel. Array is divided in chunk and each task sums that chunk and log message before & after it process chunk. I have taken simple computation problem because we want to measure cost of logging.

For bench marking purpose , i do calculation with/without logging to check how much time we spend on logging and numbers are crazy.

Java JDK 1.4 logging framework(java.util.logging) is used  for benchmark, i only took java one because all other like log4j , apache logging , SLFJ etc are almost same when they log message to file.

Details of Machine
OS : Windows 7, 64 bit
Processor : Intel i5, 2.40 GHz
No Of Cores : 4 



Numbers are really crazy, once you add logging, performance drops by 6X to 10X times, it is not worth it to use logging.

More time is spent in logging than doing real work, but fact of life is we still need logging and if it is not there then it will be nightmare to troubleshoot production issue.

Why it is so slow!
To find out why is super slow, we have to dive into the code, so lets have look at default handler(FileHandler) that java logging framework provides.

FileHandler has synchronized function that perform core logging and we know what type of performance degradation you can get in multi threaded env, so that is number 1 reason for slowness.

public synchronized void publish(LogRecord record) {
        if (!isLoggable(record)) {
            return;
        }
        super.publish(record);
        flush();
       .........
      ..........
      ...........
    }


Other reason for slowness is that I/O operation is performed for each and every call of log message, there is no buffering done, not sure why it is done like that may be to keep it simple and all these simplicity adds to performance cost. 

All of these techniques result in lot of contention and we see big degrade in performance by harm-less looking code.

 What can we do now ?
So we have to find some alternate framework that does't make our application 10X times slow!
 We have to do couple of quick things.
 - Get rid of synchronized
 - Try to add some I/O buffering
 - Use Async for logging

So i did add these stuff and created simple logging util and measured the cost of same.

So the light green one is the logger that has some of improvement and it is amazing to see that we are back to same performance and it is not adding any significant overhead, it is almost as there is no logging in code.

So by just switching logging framework we can get up to 6X to 10X benefit.

What gives up these performance improvement
  - ConcurrentLinkedQueue is used to store all the log message, ConcurrentLinkedQueue is lock free queue but it is not bounded. Bouded queue can be used to reduce risk where producer is faster than consumer.

   - Buffering is added for I/O operation, these reduces I/O operation

   - Prealocated bytebuffer is used to keep all the message that needs to writen to file.

  - Lock based wait strategy is used when there is no message in queue, better waiting strategy can be used based on requirement, but for logs i think lock based are fine.

Conclusion 
So time has come to re-think logging framework used, just using by using correct logging framework we can get big performance boost. To make application more responsive, we have to have to look for alternate. This is very important in low latency & high throughput space.

Example that i used is very basic, it does't have all the features, but can be easily extended to add those stuff. 


Link To Code
Code available @ github