How to Recover From a Performance Crippled Site – Digital Performance Architecture

One bright, sunny business day, while working away at the next sprint’s deliverables, you get a panicky call “The site is getting really slow and clients are yelling at me because they are getting errors and spinning wheels of death.” This is never the call one wants, but if one in the business long enough, it will happen eventually.

There is an expression in software along the lines of “Make it run, make it run right, make it run fast”. It’s a, shall we say, a polite way of stating a certain pragmaticism that exists in the field of getting it to work and out the door, and if we find a problem, we’ll come back and address it. This can be for any number of reasons, maybe the project is behind, maybe there was a high amount of staff turnover, perhaps even a pandemic blew up one’s roadmap? Perhaps this was some little function that the client wanted but stated it won’t be used much.

While delving into those topics at a later point (see I just did it myself!), when the call comes, and it will at some point, how can you and your team respond and recover as quickly and non-disruptively as possible?

Where can the issue come from?

Web applications, especially enterprise-level applications with all the possible connections and dependencies on third party systems, often with cloud-first or cloud hybrid solutions, various middle tier integrations with varying degrees of SLAs, interdependent but yet still often independent groups that are tasked with managing various aspects of the application(s) and infrastructure, sometimes it’s impressive the whole apparatus manages to stay running in the first place.

It’s important as a developer to have a solid understanding of ALL parts and pieces of the application. Limiting oneself to only (and I say that with all due respect here) reduces one’s understanding, increases mean time to resolution, and ultimately reduces one’s value to the organization. At a minimum, if anything above an entry-level developer on a project I would suggest familiarizing to some degree with and with whom is responsible for:

Web Application Code ( You are the developer! )
Any ancillary or dependent systems – Examples include CRMs integrations, custom ETL systems, In-house back end systems, keep in mind not only systems that your app is dependent on, but systems that are dependent on your systems.
Databases – So many performance issues revolve around databases, whether either non-optimized databases, sizing issues, poorly built queries or procs, or all that can be right, but how it’s leveraged by an application is just plain wrong.
Servers – Do you know the base configuration of the server infrastructure used?
Firewalls and networking – One bad routing rule can ruin your day
Cloud provider – Many mid to large organizations are already utilizing or looking to do so, if you want to stay relevant AND be able to help design systems and diagnose issues, build up this knowledge

As you can see, there are many points of failure possible. If the site you are responsible for starts grinding to a halt, how can one possibly know where to begin?

Instrumentation and metrics!

“But, but, we don’t have those yet!” If you find yourself in that position, do whatever you can to educate yourself and your team and help drive the necessary conversations to put in process and tooling to support it. It can mean the difference between seeing a performance related issue appearing on the horizon and a full blown outage with your site coming to screeching halt every time it comes back up due to load and no good idea of which one of the points above may be the fault. It’s not a good position to be in. We’ll cover this in much more detail as time goes on.

Did I mention databases or the leverage of databases is the cause of many site performance degradation events?

While over time we will cover many facets in the performance space, today I will share what I have learned (sometimes the hard way) with large web applications/systems. Whenever I start to see system degradation, I interrogate my database(s). If you haven’t seen this before, here is a great write up on the topic if running MS SQL Server Click Here I am looking for procedures that are taxing the CPU heavily. Hopefully, you are gathering running metrics to do comparisons of previous points in time for trending purposes, yes? As a developer in my youth, if I saw a Windows-based SQL server showing high CPU load I would instinctively pull out my code and start looking for the most complicated and join happy (cross join on XML anyone?) query I could find because that must be the offender. No, not necessarily, let me provide an example of how only understanding your code often will harm your recovery efforts:

Situation: One MS SQL server cluster powering web application showing steady full CPU consumption and application showing degradation.
There is no performance environment, just a staging space to perform QA testing.
An initial thought is that it must be the updated query to join to a few more tables because of a new requirement. Logical thought as it would be a change and something that if used rather heavily may cause the increase.
Running CPU report as shown in the link above shows three offending queries.
- Query 1 is a very simple query, a single table join but with a high execution count and it is consuming a very high amount of CPU. Why?
- Query 2 is a complex query that was updated, and its’ impact is noticeable but doesn’t explain the rampant CPU consumption across the system and far less than the other two.
- Query 3 is for a different database for a related system and running on the same database server (remember when I said it was important to understand as many facets as possible, not just the code base). This one has a complicated join, has a high execution count, and had the greatest impact on CPU consumption

Given this insight, we ruled out Query 2 as the primary contributor, maybe it can be optimized further with the additional review, but it is not the root cause that is crippling our site and hence we should not be focusing on it right now. We need to focus our efforts on Query 1 and 3.

We have to quickly tear apart and understand the organisms that constitute Query 1 and Query 3. Assuming that the server is not just severely undersized at the moment by events like:

Devops could have errantly sized down the server
Maybe a surge of traffic due to new promotion
Something that went viral driving an incredible amount of traffic to your site.

Let’s look at Query 1:

SELECT column_name(s)
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;

Well, the join is about as simple as you can make it, so one will not be able to optimize it further. What else can we look at? The primary items I look at in this situation are:

Is the procedure being called correctly in code? In a past life wearing my performance engineer hat, I inadvertently brought down a system under test (SUT) with a minimum amount of test load. In the subsequent review, the development manager was quite upset and was challenging the test results, I would have as well if the roles were reversed for the hardware behind the systems match the proposed production implementation, but yet the load was only a small fraction of what was forecasted, much less 1.5x or 2x. During joint code review was that the reason why the system was crushed was due to one developer writing an API call that retrieved all information possible for each article in the data store along with a number of custom data points and presented a subset of fields in the result set. It was designed to be an administrative level reporting of all articles and was heavy to run. Developer two had a task to handle search input and query articles that matched search criteria and display just the title. Developer 2 saw developer 1’s API and seeing it had the fields he needed called it instead! Had Developer 2 had a better understanding of the API, he/she would have chosen a more efficient route to go.
Has one or more of the underlying tables grown unchecked and unnecessarily? This one can be a plague if one or more tables joined on is a log table with no truncation or trimming. Say we have a mechanism in a middle tier that holds a record for every translation being done between two systems. Then let’s have a few other tables that are being used for metric gathering and retention and solely for that purpose. Those data points should have a defined Time To Live (TTL) and if aggregates of those data points need to be preserved for long term metric analysis be gathered/processed and summaries stored, preferably during non-heavy use hours. What happens if we don’t? Well, what starts out as a few hundred rows of transactional information with reads and writes happening 30 times a minute grows into hundreds of thousands of rows because this application is now a highly successful application with many active users and many transactions going back and forth that this system needs to capture, chronicle and gather metrics on. It should have had a TTL for single data point entry integrity of 1 week with a job to aggregate a summary and then delete older records instead grows uncontrolled and what use to take microseconds of CPU time is now several dozen milliseconds each time…

The best performance recovery is to never have to do one and we’ll go over strategies that can be adopted to help reduce or mitigate risk. If you do find yourself in that situation, having a strong awareness of the components and people involved, a fundamental understanding of what tools you do have available to diagnose down to a component will assist greatly. In the case above it was a combination of database tables becoming too large and needing some active maintenance and a code change to better tune the application to not errantly request too much information for what the requirement needed. Often times it will be a combination of items that aren’t seen until certain thresholds of usage are reached, whether active users or sizes of databases.

Cheers!