So I had a short lived SSL connection problem that affected multiple machines at the same time. The PCAP shows that the syn,syn/ack,ack happens within milliseconds, but that there is a 7-40s delay in capturing the "client hello" packet, which is the next step in the SSL setup process. Under normal conditions this would be in the milliseconds range. All of these servers are physical servers.
Edit: the client hello isn't the only packet that shows lots of delay, I just focused on that. There are also tons of resends with 10-30s delays as well. The problem is: when exactly is the packet logged in the PCAP, if it's on receive, then this implicates the OS and if it's on the send then this implicates the network; but I haven't been able to determine how exactly the packet is logged.
If the packet were captured as soon as the NIC received it, then this would indicate that many different physical servers were all the sudden hit with a delay that happens directly at the OS level and affects I/O, like kernel stalls, of which there are no record.
If the packet were captured as soon as it is complete in leaving the NIC, then this would indicate that the network is unable to take on traffic, but there are no increase in frames, filled buffers, dropped links; but still seems like the most likely scenario.