Performance. We all know the importance of it. We all have this innate sense that any system can and should do better. Yet again, so many questions arise. Where do you start? How do you measure it? Do you aim at it deliberately or do you bet on improved performance as a side effect? Let me tell you a story of how we went from a 99 percentile response time of 4000 ms to sub a 10 ms.
In brief, two big changes got us there: moving from .NET Framework to .NET Core, and getting rid of a third-party dependency. Nothing fancy, but the end result was too good not to write about.
But first things first, let’s define our system. It is an API whose main purpose is to serve media content for NRK, the Norwegian Broadcasting Corporation. The Norwegian BBC iPlayer equivalent, if you will. Media in this context is playable content such as TV and radio channels, new clips, and TV programs. The API serves two main resources – manifest and metadata – per media type. Manifest is an HLS (HTTP Live Streaming) manifest. Metadata is the surrounding media metadata such as titles, usage rights, images, and video versions (i.e, master, audio described or sign language interpreted).
The API is written in C# on ASP.NET Core Web API. It is hosted on Azure in two data centers with an NGINX and Varnish gateway in front. Data is stored in Azure Blob Storage and Cosmos DB with some additional in-memory cache.
Moving from .NET Framework to the .NET Core was driven by technical lag. Technical lag is when your platform, runtime or framework is lagging behind major versions that can be highly beneficial for you. It was about time to move on to the latest and greatest. Many had already reported lower response times after transitioning. Moving from .NET Framework to .NET core took us a couple of weeks and the end result was a great surprise. The 99th percentile response time for a manifest request of any media type on .NET Framework was well above 4000 ms. This means that 99% of the requests are at most 4000 ms, whilst 1% of the users are experiencing at least 4000 ms response times. With .NET Core, the 99 percentile dropped well below 1000 ms and we experienced far more stable response times. It is hard to pinpoint exactly what in .NET Core was benefiting us performance wise, but there is not doubt the transition was worth it.
With .NET Core it was also easier than ever to completely change the runtime environment. After years of Windows based App Services, we decided to move to Docker with Alpine Linux, the smallest .NET Core runtime Docker image. 70 MB footprint with all our assemblies included is quite impressive. We do not have any numbers to compare, but our experience tells us we believe that the overall system is more reliable, especially under high load.
Any video or audio playback – any manifest request – was dependent on a third-party asynchronous (HTTP) load balancing service for CDN distribution. This service is necessary in order to distribute playback content across multiple CDNs providers. The 99th percentile response time for this dependency is somewhere around 30 ms, however we have seen cases where it suddenly increases tenfold. This is where a percentile is far more revealing than an average.
Increased latency on the dependency means increased latency on a manifest request as illustrated below. The delta between the two is barely noticeable so it is likely that this dependency is the leading factor for the overall manifest request latency.
Latency was actually never the main driver for rewriting this dependency. A business decision was. It was decided to no longer use a third-party. We were forced to write our own CDN load balancing service.
Simply put, the new service is now a DLL dependency. Any video or audio playback – any manifest request – is synchronously retrieving a CDN URL for the manifest. The only asynchronous part of the service is a couple of timer based background tasks using .NET Core Hosted Services. The tasks fetch necessary data to compute which CDN to distribute. The data is stored and shared in memory so there is little to no performance impact on the overall playback request.
Our immediate reaction after replacing the third-party dependency was that something had to be wrong. The 99th percentile response time for a channel manifest request was previously somewhere around 37 ms. After the swap, it fell to a jaw-dropping sub 3 ms. We also noticed the nightly “heartbeat” disappear. Every night between 11:00 PM and 6:00 AM the 99 percentile response time seemed to increase by a factor of 3-4. This was most likely due to fewer cache hits at the third-party dependency.
We observed a similar effect with a clip manifest request. The 99th percentile fell from 30 ms to sub 10 ms, and the nightly heartbeat was also gone, more or less. Quite satisfying.
The same effect was not present in a program manifest request. This is due to a different and greater bottleneck caused by storage access. We are currently working on this so we hope to see the 99th percentile equally as low the coming months.
I wish I could say we deliberately aimed for better performance, when in fact, we got there more by the side effects of implementing new business requirements and not lagging behind technologically. Performance is complicated. You can twist and turn loops and allocations, but sometimes luck comes your way and there is lot of improvements you can utilise. I have, nonetheless, learned that that averages can be misleading. They say listen to your heart. I say listen to your 99th percentile. Identify your dependencies and their 99th percentile. That can be a good start.