1202Performance

Top 10 Tips on Getting Observability Right

This was featured as the Capacitas thought of the Week and as I was the author thought I would capture it on my own blog! See the original here: https://www.capacitas.co.uk/insights/top-10-tips-on-getting-cloud-observability-right

After buying a tool, create an ecosystem of people and integrations to get observability that provides value.
Make sure the tool fits your technology stack (ok, this may seem obvious but be careful as the tool needs to work with legacy tech as well as the new and shiny tech).
Spend time configuring the tools to make it readable and relevant to the users i.e., Rename IIS application pool AGKP_04520756 to UKAGKPPortal.
Remove noise from the tool e.g. If volume /dev/u001/fred is always full then configure the tool not to show it in red!
Try to avoid underlying infrastructure alerting, it doesn’t matter what your hardware is doing, it is your users experience that matter. Set alerts at the UX level.
Configure alerts that align with the business. For example, set alerts for critical user transactions such as time to generate a quote or complete payment rather than a generic alert across all user transactions.
Decide who are the consumers of the tool and work with them to make sure they are trained; the tool should be configured for their needs. I am not talking about just turning them into a dashboard. That would be teaching them how to fish!
These tools do love data sources. The more you give them the more likely you can correlate the source of problems i.e., poor performance could be due to waiting for a VM to be scheduled on the hypervisor. You may never know this unless you are monitoring the hypervisor. Of course, the more you monitor, the more you pay!
Don’t be afraid to have a dedicated monitoring team but ensure they are skilled in using the tool and not just configuring the tool. i.e. When production goes down, they are in the thick of it trying to resolve the issue.
Keep an eye on the costs!

Dynatrace Performance Troubleshooting Example

They do say the exception proves the rule. In this example, I do deviate from how I normally troubleshoot a problem. But, it was a bit of an odd problem!

One of the account managers I work with was getting worried. They are using data from Dynatrace to calculate an SLA metric for end user response times. In the recent months the response time for a key transaction had jumped significantly. pushing the value into the red. The SLA reports the 95th percentile. So, I was asked to have a look to see what the problem could be.

Looking at the response time for the transaction the median is very good at 350ms so I switched to the slowest 10% (90th percentile) and that was just over one minute. So, the next step was to look at the waterfall diagram for one of the slow pages. The waterfall is shown in the graphic below:

The waterfall is showing the delay is for the OnLoad processing. Typically the OnLoad processing is the execution of JavaScript after the page is loaded.

Normally, I would then try to recreate the problem myself and try to profile the JavaScript with the browser developer tools but that takes time and I still had some more digging to do in Dynatrace. Next, I wanted to see how this looked for a users session and I noticed something odd. Here, is the session for a user.

There are few things of interest:

(1) The user is just using the problem page/transaction

(2) The page is called every minute. (This suggests that this is automatically called rather than initiated by the user)

(3) Response time is good and then goes bad, and is a pretty consistent value.

I was surprised that a page this slow didn’t already have users complaining about it. So, I decided to see if I knew any of the users and then I could just check this with them. As it happens I recognized one of the users and I gave them a call and the mystery was explained.

The config.aspx page is a status page used by certain users, It transpires that this is accessed from a desktop device within the data center which means users have to remote desktop onto the desktop from their laptops (don’t ask why). We, did some live tests and discovered it occurred when the browser was left open and the remote desktop went into sleep mode! Therefore, when we get slow pages there isn’t a real user at the end waiting for a response.

Quick chat with the account manager and we agreed that page should be excluded from the SLA calculation. Problem sorted!

Performance Monitoring choosing a Sampling Period

I was recently visiting the St Ives Tate Art Gallery (I know I am so cultured/it was raining) and came across an art work called Measuring Niagara with a Teaspoon by Cornelia Parker. The blurb about the work is based on her interest in the absurd task of measuring an enormous object with such a small object. In this case Niagara falls using a tea spoon.

So, as a performance engineer this got me thinking about how we choose the sampling period when collecting performance data. We generally believe the more granular (shortest sampling period) the better as detail is lost when data is sampled over longer periods. For example, the graphic below is for CPU utilisation sampled at different periods and as you can see the odd behavior of the CPU does not become apparent until sampling every second.

However, the shorter the sampling period the more data we collect and have to anaylse. There are often times when we don’t have the luxury of being able to collect data with a really short sampling period. In which case how do we choose the best sample period?

For me, it comes down to what am I looking for in the data. For example, users are complaining about an intermittent problem where for about 15 minutes during the day response time is really slow across all user transactions. The time delay in the incident reporting means I can’t rely on using a circular buffer to store a couple of hours of highly sampled data which could be flushed just after an incident. So, I need to collect day, I would live with a sampling period of about 5 to 10 minutes. This means I would have enough data to correlate changes in resource usage with when the users reported performance issues.

Another example would be for looking for reasons behind slow response times. For example we are looking for reasons the response time has increase to over 6 seconds. In that case the minimum I would accept would be 3 seconds.

You can see in most cases the rule I apply is at least sample at 1/2 the duration of the period of interest. Ideally I would go 1/3 the period of interest. As they say you need 3 data points to draw a curve!

There is possibly some science behind this. There is the Nyquist Theorem which is used in signal processing. It says you need to sample twice as much as the frequency of the analogue signal you are converting. There is a bit of maths to prove the theorem but basically it makes sense to think you need to have at least two data points to detect a noticeable change in the system you are sampling.

How to Manage Complexity (Martin Thompson) Interview

While some people provide the odd cartoon at Christmas! Dave Farley provides something more substantial this festive season. I always like to listen to Martin Thompson on any of his performance work. This interview is a bit more general than purely performance related work, but it is still worth taking the time to listen. Also, I found it very interesting to hear that his consulting engagements are moving away from improving response time to improving efficiency.

To me the take away from this interview are listed below:

(1) Keep things simple

(2) Concentrate on the business logic

(3) Work in small steps, feedback and make small incremental changes

(4) Most performance problems he sees are systemic design flaws

(5) Measure and under stand (model) the system you are trying to improve and compare that model of what should be achieved.

(6) Consider separation of concerns in the design will lead to simpler code (“one class, one thing; one method, one thing”)

(7) If you haven’t heard about it think about Mechanical Sympathy https://dzone.com/articles/mechanical-sympathy

Christmas Cartoon 2021

Response time blip when the system is quiet

Working with a customer using Dynatrace there where some noticeable peaks in response times for a application that wasn’t highly used. As can be seen in the graph below:

The application was web based and hosted on IIS. I had come across problems like this before and suspected it was due to the idle timer setting for an application pool. By default IIS will shutdown worker processes after being idle for 20 minutes. However, for some applications the time to restart a worker process can be noticeable slow.

If you set the idle Time-out to 0 then it will not terminate and you won’t have to suffer the restart overhead. I have made this change several times and not seen any adverse affects. But remember all applications are different!

1202Performance

Christmas Cartoon 2025

Christmas Cartoon 2024

Christmas Cartoon 2023