When looking for a reverse proxy that serves as the entry point for our server infrastructure, we stumbled upon the excellent Traefik. Of course, we wanted load balancing and, of course, we wanted to assign weights; so, we skimmed the documentation and quickly found what we were searching for.
Traefik supports load balancing in the form of Weighted Round Robin (WRR), which directs web traffic in cycles through the available services. This sounded just like what we wanted. However, as we discovered, it was not entirely so. Our journey begins here.
Update: One day after we proposed our workaround in the their GitHub repository, the Traefik team started working on the feature. It will finally land in Traefik v3.0.
TL;DR, You may you want to jump directly to the solution.
Services versus Servers
There was something we had overlooked: a small hint in the documentation that WRR is only available for services, not for servers.
This strategy is only available to load balance between services and not between servers.
Initially, we had to learn how to configure Traefik correctly and had no idea what services and servers meant in the context of Traefik. Now that we know, we wanted to serve a single service on multiple servers as follows:
services:
my_service:
loadBalancer:
servers:
- url: "https://server1:1234/"
- url: "https://server2:1234/"
- url: ...
Since every server has a different level of power, we also wanted to weight each server. However, it is not possible to assign weights to the servers in this setup, and there is also an open GitHub issue for that since 2019.
Only Traefik v1 had a weight property, but we did not want to rely on legacy software.
Ideas
We had different ideas on how to overcome this missing feature.
Idea 1: Misuse Health Checks for simulating bad servers
The idea was to mimic an unhealthy server when it reached its maximum capacity. This way, Traefik would stop directing traffic to the server once it had no free slots left.
# Example health check configuration
http:
services:
my-service:
loadBalancer:
healthCheck:
path: /health
interval: "5s"
timeout: "3s"
<pre class="wp-block-syntaxhighlighter-code"># Example FastAPI endpoint
@app.get('/health')
def health():
if COMPUTE_CAPABILITY / job_q.qsize() < 1:
return Response('Too Many Requests', status_code=429)
return Response('OK', status_code=200)</pre>
Whenever the Job Queue would overtake the server’s compute capability, it would become reported unhealthy to Traefik within at most eight seconds. Traefik would then take it out of the Round Robin and let other servers serve the request. We tested this solution and as it turned out, this caused problems.
Firstly, the server dropout is slow. Even when overloaded, servers still receive requests until the health check reporting drops. Eight seconds can be a long time, and reducing the interval was not an option for keeping the server load manageable.
Secondly, and most surprisingly, we encountered NS_ERROR_NET_PARTIAL_TRANSFER
errors for some requests. We did not investigate further, but we believe this occurred due to servers being caught between serving long-running requests and reported as unhealthy, causing the client connection to break up. Maintaining such an infrastructure is undesirable.
Thirdly, our service is stateful. This means that we use sticky sessions, ensuring each client communicates only with one dedicated server assigned at the beginning of the session. Having one of these servers become unavailable during a session poses a problem we thought we could solve. We could not. We dropped Idea 1 instead.
Idea 2: Manipulating the Session
When establishing a session, we use JavaScript fetch()
with credentials included:
await fetch('https://edge_server/', {credentials: 'include'})
This way, Traefik assigns a new session Cookie to the client with the Set-Cookie
header on the initial fetch request.
# Example Traefik configuration with sticky sessions enabled
services:
my_service:
loadBalancer:
sticky:
cookie:
name: session_name
httpOnly: true
If we omit the credentials in the request, Traefik would assign a new session on each new request using the Round Robin procedure. Idea 2 was to allow the client to gather new sessions until it found a server that had free slots
left.
We only had to define a new FastAPI dispatch endpoint that informs the client if the server behind the current session has free slots, as well as a fetch retry method in JavaScript to find the right server. For the endpoint, we could reuse the health()
method from above. For the fetch retry method, we of course also wanted to handle the case when all servers were busy. We came up with this:
<pre class="wp-block-syntaxhighlighter-code">async function fetch_retry(...args) {
for (let i = 0; i < 300; i++) {
for (let s = 0; s < N_SERVERS; s++) {
const response = await fetch(...args);
if (response.status === 503) {
// Waiting for a free slot...
if (s === (N_SERVERS - 1)) {
await new Promise(r => setTimeout(r, 20000));
}
continue;
}
return response;
}
}
return response;
}
</pre>
Now, before any other call to fetch()
, we called our dispatch endpoint without credentials:
await fetch_retry('https://edge_server/dispatch');
And indeed, we received a new Set-Cookie
header for each loop iteration.
Unfortunately, the existing session cookie does not get updated with the fetch_retry()
call, as this only happens when credentials are included, at least with Firefox. We then tried including them and clearing the session
storage when we have a session for a “bad” server. Unfortunately, sessionStorage.clear()
does not clear the credential cookie, even though it is also a session cookie. This is true even when setting httpOnly
to false (although this is not recommended for safety reasons anyway).
So, we had to sleep another night before coming up with another idea.
The Final Idea: Faking Servers
The final idea was based on the question: How does Traefik handle server URLs?
If we could point multiple URLs that are seen as different in Traefik to the same server, we could emulate server weight by faking multiple server URLs that all point to the same server. And indeed, this worked as expected. We just needed to add a non-existing path, such as “/1/” or “/2/”, to the URLs, and Traefik would handle them separately but route them without errors and without attaching the path to the right server.
services:
my_service:
loadBalancer:
servers:
- url: "https://server_A:1234/1/"
- url: "https://server_A:1234/2/"
- url: "https://server_B:1234/"
We only had to ensure that we created the correct number of redundant fake server URLs corresponding to the respective server capability. So if we had a server A, for example, which was twice as capable as server B, then we needed to add two URLs for server A. This way, during Round Robin, server A would receive twice the number of requests as server B.