The issue
So a while ago, while working on a project, I encountered an endpoint that required some heavy computations to produce the response and as a result the response usually took a bit more than 5 minutes.The examples that existed showcasing the usage of said endpoint were all done with curl.
So, my turn comes to consume this endpoint and lo and behold! The request hangs. Okay, weird. Let's try curl again. Works fine. Python, nope. Let's check with wget then, no luck.
In my desperation I tried all the Python libraries I had used in the past (requests, the built in http.client and aiohttp) obviously setting all the applicable timeouts sky high. Still no luck!
So what's so special about curl, what did it do that both wget and my Python implementation failed to?
Troubleshooting
Desperate times call for desperate measures I told my self and I grabbed my trusty strace brewed a big ol cup of coffee and got to work.First thing that is immediately obvious to me is that
curl
does indeed do something different as strace tells me that curl blocks with epoll while both wget and my Python solutions block with select. This gives me a first clue that curl does indeed do 'something else' (TM) besides just waiting for a response but leaves me with not much more to follow on.I decide to switch context and go from the lowest possible (for me at least) level to my most high level approach: replicate the Python solutions with insanely high timeouts and monitor its behavior. This yields an interesting result! The target endpoint supports a notation for the client to specify the seconds until the request should timeout, but despite my explicit definition of it to 600 seconds (that's 10 minutes) the remote server hasn't sent me (or should I say I haven't gotten ;) ) an explicit timeout for more than 15 minutes. This brings back bad memories ... This uncannily resembles the behavior of a firewall that instead of dropping the packets it just filters them, this way the client never explicitly knows that it cannot connect and just waits.
But the reminder the curl worked perfectly quickly snaps this thought out of my mind. Time for another coffee (I could use the break anyways)! While I wait for my coffee to brew I start whining to (ehm .. I mean discussing with) a colleague (SysAdmin) about the issue and my findings. In a heartbeat he suggest a misbehaving firewall. But why! How could
curl
go through I still don't get it. It's nearly night and there's a weekend ahead of me so I call it a day.Monday morning I find a set of netstat commands (one with curl running against said endpoint and one with wget) along with their result. Son of a female dog!
With curl:
$ netstat -at --timers Proto Recv-Q Send-Q Local Address Foreign Address State Timer ..... tcp 0 0 localMachine:lPort remoteMachine ESTABLISHED keepalive (60/0/0) .....
With
wget
:$ netstat -at --timers Proto Recv-Q Send-Q Local Address Foreign Address State Timer ..... tcp 0 0 localMachine:lPort remoteMachine ESTABLISHED off (0.00/0/0) .....
So now I know! curl uses a TCP level keepalive which means that a TCP packet is transmitted in a fixed interval regardless of whether there are any actual data to transfer. So there is in deed a misconfigured firewall somewhere along the way that chops down long running connections (anything longer than 5 minutes as I found out with some troubleshooting) without informing either party that the connection was dropped.
And now what ?
So you know what kind of problem you have but this isn't even half the solution. Unfortunately I could find in none of the http libraries I use in Python a reliable way to enable TCP keepalive through their API.Show me the code !
I took the simplest and more 'core' solution here because it's easier for showcasing the approach; the same logic would apply with any other library.# construct your headers; maybe add a keepalive header here to avoid the remote server closing the connection headers = {} # create your connection object conn = http.client.HTTPSConnection(host, timeout=600) conn.connect() # Now you will need to access the socket object of your connection; how you access this will vary depending on the library you use s = conn.sock # Set the following socket options (feel free to play with the values to find what works best for you) # SO_KEEPALIVE: 1 => Enable TCP keepalive # TCP_KEEPIDLE: 60 => Time in seconds until the first keepalive is sent # TCP_KEEPINTVL: 60 => How often should the keepalive packet be sent # TCP_KEEPCNT: 100 => The max number of keepalive packets to send s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1) s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60) s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 60) s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 100) conn.request("GET", "/endpoint", {}, headers) response = conn.getresponse() data = response.read()