Last week, I was working with a customer for a large Internet site (over 20 million users) who was having some performance problems with some of their internal infrastructure. The issue: slow connections to a HTTPS service. After buying some new super-duper big-iron servers, this customer (using SteelApp Traffic Manager) started to move services off the “old and busted” and onto the “new hotness” and immediately started seeing slow connections. And by slow, I mean 12 seconds slow. On “old and busted”, these HTTPS transactions were humming through at around 230-250 ms from Go to Whoa. Adding insult to injury, the slow connections weren’t occurring for every transaction – it was more like every 1 in 4 connections was slow. Now the customer swears that the new version of the software is the root cause of the issue, but I doubted that to be so. I organised with the customer to get a tcpdump from the Linux host that was running SteelApp Traffic Manager, and got the customer to recreate the issue in order to have some hard data to work with.
From the packet trace I could see that the sessions were all coming from a single IP address – it was unlikely that a routing issue further upstream was causing the issue. I started collecting statistics to see how many sessions were being affected, and the first thing I looked for is how many total sessions are in this trace using the wireshark filter (tcp.flags ==2) && (ip.dst == a.b.c.d) - in other words, show me all the SYN (or connection start) packets that are destined to the IP Address a.b.c.d. Instantly I can see the proof of the slow connections. The image at the left is a grab of the actual trace – you can see the initial SYN (or the first time the client attempted to connect), and the client will wait patiently for the next part of the “Three Way Handshake” which is a SYN-ACK. The client doesn’t see a SYN-ACK and so it sends another SYN 1.05 seconds later.. and another 1.10 seconds later, and so on until the server finally sends the SYN-ACK 11.72 seconds later.
Here was the proof I needed. Incoming connections are managed by the operating system, the network stack. It is only once the “Three Way Handshake” is completed that the connection is passed up the stack to the application. The version of the software has no relationship, in this case. The root cause is much lower than the network stack, waaaayy down in the kernel.
In this scenario, the root cause was the network stack running out of available connection space in the receive queue. The underlying Linux operating system had not been tuned for a larger number of incoming connections. The root fix for this issue was to tune the network parameters, specifically setting net.core.somaxconn=1024 from its default setting of 128. After making this change, the performance issue was resolved.