The single most important thing whilst building an Adtech platform for OpenRTB bidding is sub milli-second response times. My job primarily is to achieve that. To serve billions of ad requests, a huge infrastructure is required. And saving up on infrastructure costs becomes a huge task.
When building robust Microservices and deploying them on Kubernetes or any other platform, one would have surely encountered latency issues between different services. As the traffic increases, the pods (or servers) need to be scaled up accordingly for a particular service.
But it is not always necessary that you need to linearly scale one service based on the traffic it receives from the other. Say there is a service whose job is to respond with the eCPM value, or a targeting service which does runs complex lua scripts in redis to figure out the targeted partners for a particular device. This data wouldn’t necessary change per-second. A user is not going to change states in a few mins neither is the eCPM value going to change every second. Many a times, data required from the other service doesn’t change continuously, and if this data can be cached briefly, a lot of traffic can be served faster without making a network call.
Let’s take two services for this example, Service A and Service B. Service B figures out some data from redis and provides it to Service A on request. So ideally when the traffic increases, both the services need to be scaled linearly to handle the load.
Let’s benchmark Service A and Service B and check their response times. Both services are being monitored using prometheus.
A simple ab test gives us the following result.
Increasing the concurrency to 40
As you can see, from the charts, as the RPS increases on Service A, the time taken for the client to obtain response from Service B increases too. But Service B prometheus stats do not show much of a difference, and the response time is pretty consistent. What is the reason for this behavior?
THE TCP GAMEPLAY
A lot happens under the hood for a TCP request, when two services interact.
Lot of things come into play when two services interact over TCP, the number of sockets involved(Socket layer), the network IO speed(Protocol/Interface layer), number of cores of the machine (apart from application code). Bottlenecks could happen at any of these layers. The sockets may run out because of a limit on the system, not enough threads available on a server to serve (hence shooting up the load average leading to delay in response). Hence Service B metrics only the record response times on the application layer, whereas Service A metrics (for Service B) takes into account the whole tcp cycle, which includes the connection time, tcp wait time etc.
How do we solve this problem in general? A few obvious approaches would be to either add more servers or upscale the hardware. But is it always required? Is the data being served from the other service needed in real time that every single time a request has to be made? These are few questions every application developer must consider. It is not necessary that the configurations or values that we fetch from the other service may update frequently (or every second). That is where things could be optimized.
KEEP EVERYTHING IN MEMORY
High frequency trading systems had always intrigued me as they are limited by the same problems. But they seem to overcome it and provide responses in real time. Here are a few points to note when building a HFT platform:
- Optimize code for performance
- Keep everything in memory
- Keep hardware unutilized
- Keep reads sequential, utilize cache locality
- Do async work as much as possible
Keeping everything in memory is a good viable option for us since in most of our cases, the RAM remained underutilized compared to the cores. Remember, Service A requests data every single time from Service B, but Service B’s data doesn’t change frequently. When updates/writes are not very frequent, one could cache the response from one service locally for a few minutes to avoid requesting the other service for the same data again and again. Similar to how L2/L3 cache works on a processor to avoid requesting from main memory/disk all the time, the same idea can be applied at a service level to increase the traffic for better response times.
ZIZOU, an alternate answer
Keeping this in mind, I made a simple library called Zizou which serves as an in memory (sharded) cache, which can store keys locally in memory, with eviction policy enabled. This just acts as another layer of cache locally on the system, similar to what one would store on redis to avoid lookup’s from different services for the same stale data. And since the data is served from memory, it helps your service respond back in pretty quickly and frees up your resources for other important tasks.
Using it is pretty simple. You have simple Get/Set functions which keep a map of the key value pairs in memory with the expiry specified.
A background goroutine runs periodically which evicts deleted keys from memory to free up the heap.
While this may seem a trivial caching mechanism which many would have implemented, this simple hack has helped me save a lot of money in infrastructure because this avoids unnecessary scaling of services with increase in traffic, thereby reducing traffic on other Microservices in our cluster, and therefore not scaling the services unnecessary which would otherwise scale based on the custom resource RPS metrics in Kubernetes.
Here’s the code.
Live, laugh, and always Cache.