Performance: Measuring Latency and Throughput of GraphQL Systems
Performance is an important issue for any API and especially for those that drive user experiences. It's well understood that user engagement with an application drops quickly with increases in latency. This blog lays out some of the important considerations for a high performing GraphQL system and tips about what to measure and track.
We're also providing a new StepZen GraphQL Benchmark tool that you can download from GitHub. It'll help you measure the most important aspects of your GraphQL system performance for yourself.
High Performing Systems: Measure What Matters
High performing systems are results of many optimizations, each of which typically provides a small performance benefit. This aggregation of marginal gains leads eventually to systems which perform well even when operated under stress. See the HBR article How 1% Performance Improvements Led to Olympic Gold for an interesting analysis of this.
The first step, however, is not any optimization. It's setting up a reliable way to measure the things that matter from a performance perspective. Measuring performance in a meaningful way is what this blog is about.
Throughput and Latency Matter
Traditionally, for serving systems, two metrics have been as key measures of performance. These are throughput and latency. Throughput is the number of calls that the system can field per second, typically measured in requests per second. Latency is how long it takes for a client making the call to receive a response to the call. A typical performance study will measure the throughput-latency curve.
It's known and broadly understood that as systems get loaded, latency increases, something that we all may have experienced.
It is also known that in systems that are unhappy (for example, those that are completely out of memory, or are otherwise buggy and leak resources) are both slow and don't process many calls -- both throughput and latency suffer. There is no trade off between these two parameters, both parameters are bad. And therefore, any throughput vs latency curves that were established when the system was behaving well are meaningless when the system is in an unhappy state. Put another way, system state is a hidden variable that influences both latency and throughput.
Whether a system goes into an unhappy state or not, and how often it does, is often more relevant to performance than is a throughput-latency trade off.
Measuring Constantly Matters
A common mistake is to focus on the throughput vs. latency curve as a measure of performance. In reality, conditions that cause the system to undergo a behavior change or phase change and become unhappy are just as important, if not more so, than building high performance systems.
To do this one needs to measure performance of live systems constantly. It's not a one shot thing. At StepZen we focus not just on the performance characteristics of our system, but also on the vitals of the production system. Memory usage, and CPU usage, in addition to latency and throughput.
We are looking for two things:
- Changes in each of the monitored vitals. Any significant change will cause alerts to be issued.
- Slow degradations over the long term.
System performance is a moving target. It needs to be tracked, not just measured.
On Benchmarks and Measurements
Many vendors publish performance benchmarks. We do as well, sometimes. While these are useful indicators, you should never rely on them only. If performance is important to you, trust, but inspect.
Performance benchmarks are usually built in idealized environments tuned to the strengths of a system. Real-world workloads run in chaotic environments, and force both the best and the problematic parts of a system.
StepZen GraphQL Benchmark Tool
So if the performance of a system is important to you, then measure it yourself. The StepZen GraphQL Benchmark tool will help you get started.
As you measure your GraphQL system, keep the following tips in mind:
- Measure it using a workload that is near realistic.
- Measure it on a production system, and not on something set aside for testing.
- Measure regularly, and compare the performance you get with what you got previously.
- Make sure you know the reasons for any significant differences you observe.