The Response to the Wrong Request

I encountered this in my work at Segovia. It appeared quite rarely, and everyone’s best efforts at debugging turned up nothing. For all I know it’s still happening. When I left the company, I wrote in the internal bug tracker that if anyone ever solves this mystery, they are to contact me and tell me what the hell was going on, or else I’ll be haunted forever.

I write this story with the faint hope that someone will read it, recognize this problem, and finally give me some closure.

Symptom

The bug was in a microservice that we’ll call “Wrangler” for our purposes here. It exposed an HTTP interface internally, which was called from the main server application.

Very rarely — as in once every few weeks — the main app would throw an error that seemed to indicate that Wrangler had returned a response that looked to be for another request entirely. Wrangler had several endpoints and returned a different shape of JSON from each; in these rare instances, it would appear that a request to endpoint A had returned a response in the shape of a response from endpoint B.

Background

Debugging

The main server app (the client of Wrangler) already had full logging of all the HTTP requests it made, and their responses. What that indicated was that when the bug happened, there were two requests going to Wrangler near-simultaneously:

These requests were coming from different instances of the main app, on different EC2 hosts, so there was no possibility that it was some kind of mixup there. The problem had to be on the Wrangler side.

This bug was kind of a worst-case scenario for debugging because there was no apparent way to repro it on-demand, and it occurred very rarely in the wild. It was essentially impossible to “catch it in the act”, and you might have to wait for weeks until your next chance to confront it.

In these situations, your only real option is to blanket the offending code with logging, wait for the bug to happen again, and sift through the piles of log messages.

What our Wrangler-side logging indicated was:

That seems to point to the two requests’ sockets somehow getting swapped. Request 1’s response got written to Request 2’s socket, and then Request 2’s response couldn’t be written anywhere. Meanwhile nothing was written to Request 1’s socket, and the client timed out.

I put logging everywhere I could reach, but nowhere in the resulting messages did we see any kind of mismatch. The request body, response body, and socket remote addresses always matched.

I tried repro’ing by making lots of simultaneous requests from a script, but it never worked. The bug just continued to happen once every few weeks in production.

I started reading the Node.js http.server code, but didn’t spend too long on it — this bug just wasn’t worth spending hours and hours on — so I can’t make any definitive conclusions from that.

We upgraded Wrangler from Node 12 to Node 14 partially as a “maybe this will fix it” kind of measure, but as noted above, it didn’t work.

(Unsatisfying) Conclusion

Everything about this bug looks like a race condition in the HTTP server infrastructure, but I feel like it can’t be, purely as a practical matter. Node’s HTTP server library is in extremely heavy use in the wider world, and if there really were a race condition that caused a bug like this, it’s unimaginable that we would have been the first to hit it.

Still, I can’t come up with any theory, where the bug is in our code, that isn’t ruled out by the evidence. I don’t know what to think.