I ran into an interesting issue today that I could not find anywhere on Google, so wanted to document it.
In my move over from C++ app development to web development, I suddenly found I was hugely distanced from the operations side of things. Part of this was an unfamiliarity with Linux systems, but also because of the more delineated roles between developers, DevOps and operations. More things are automated and handled by DevOps than I previously had to worry about – which leaves me more time to code – but does mean I am less comfortable at this than I used to be. However I still find one of my favorite things to do is investigate the root cause of particularly tricky or hard to reproduce bugs, and this has led me to find some interesting edge cases with our technical setup.
It is common in web development to have multiple layers of routers and load balancers on top of each other performing slightly different roles. e.g. a security / payload manipulation layer on top, then a routing/caching layer underneath routing the request to an app itself. To take advantage of the hardware the software is running on, it is then common for the app itself to have many workers (in a combination of different processes, threads and greenlets) to service the requests.
The app I was working on today used nginx for routing, and then gunicorn for actually serving the code. I’d solved an issue a while ago where nginx has a setting called proxy_next_upstream
, which by default passes a request that has timed out onto the next node in the pool. For non-idempotent requests this obviously can cause huge issues, e.g. an object being copied multiple times instead of just once. The solution was to change nginx to only retry a request on a different node when the current node is down (i.e. returning a 502
to nginx) and otherwise return an error after the timeout occurred. This is done by changing this setting to proxy_next_upstream error;
(compared to its default, proxy_next_upstream error timeout;
)
Recently I noticed the exact same issue I’d found previously seemed to be back, although all configs were correct. I found two odd things – 1) the request was being retried every 30 seconds, even though the nginx timeout was set to the default, 60 seconds, and 2) there were NO further logs from the request that timed out once the 30 second had elapsed, whereas previously the app continued processing the request as normal.
Eventually I found that Gunicorn has a default worker timeout of 30 seconds – and that when it kills a worker, it does two things. First – it completely stops the worker from completing any more work (which is probably good if there’s a deadlock or anything happening) and second, it returns a 502
status code to nginx (Bad Gateway), not a 504
(Gateway Timeout). This means that nginx thinks the request has not been processed at all – and passes it on to the next address in the pool. The reason it hadn’t surfaced previously was because the app was previously using Waitress – it wasn’t until a move to Gunicorn that the problem appeared.
Given the popularity of both nginx and gunicorn, I’m surprised this issue hasn’t been more documented. I guess it is rare for a request to take longer than 30 seconds, and when it does it has just been missed that it was retried.
The solution of course is to just adjust the timeout so that Gunicorn is at least as long as nginx. I ended up giving ours a little more leeway – so that any downstream connections (with the same timeout value) that might be slowing us down will timeout, let us do any cleanup, log a few things, and then get out, rather than killing the worker at exactly 60 seconds.
As always, it was incredibly satisfying figuring out and solving this issue – although I was quite chagrined to see how relatively simple the original issue was, despite manifesting in some confusing and difficult to pin down side effects.