We have a linux router on a Xeon Gold 5218 that does everything but BGP (queuing, firewall, NAT, DNS). We do queuing on HTB on the physical interface where the classes looked like this at first:
/sbin/tc qdisc add dev eth1 root handle 1: htb
/sbin/tc class add dev eth1 parent 1: classid 1:1 htb rate 16000mbit ceil 16000mbit quantum 1536
/sbin/tc class add dev eth1 parent 1:1 classid 1:E789 htb rate 105000kbit ceil 105000kbit
/sbin/tc class add dev eth1 parent 1:1 classid 1:12D6 htb rate 440000kbit ceil 440000kbit
...
U32 hash filters are generated using prefixtree.
Everything worked without a problem, however, traffic was reaching the maximum capacity of this CPU with the current configuration (about 13G) so I started thinking about changing the machine. It fell on the AMD EPYC 7402P. However, before I could switch machines on Intel out of nowhere queuing failed. With traffic around 10G+, everyone who dropped into classes had 10-20mbps and high pings. At this point the CPU had free resources all the time. After removing the client from the queuing filters, speeds immediately went up from 10mbps to 1G. Previously on the same configuration the machine was flipping 13G and nothing was happening (except CPU load) so not sure what the issue was. Due to the fact that I was going to switch machines anyway, I decided to see what would happen, and here’s a surprise – on AMD the same queuing problem occurred already with 5-6G traffic, which made me go back to the old machine.
I started looking for a solution and it turned out that adding burst and cburst to the classes solves the problem. From then on, my classes looked as follows:
/sbin/tc class add dev eth1 parent 1:1 classid 1:E789 htb rate 105000kbit ceil 105000kbit burst 150k cburst 150k
/sbin/tc class add dev eth1 parent 1:1 classid 1:12D6 htb rate 440000kbit ceil 440000kbit burst 150k cburst 150k
After the new configuration on Intel warmed up, it was time to switch machines (in the meantime, a kernel upgrade from 4.19 to 5.10 was done on AMD). Everything seemed to work fine until reaching about 10G when the problem returned and again queued clients were at 10-20mbps. Manipulations of the burst/cburst value did nothing. The processor as before was not overloaded.
I noticed one big difference between the machines – On the Intel at the time of the problem when I set queuing only for clients with a rate less than 150mbps everything started to work fine – it seemed normal to me because the smaller number of packets that need to be classified caused the traffic to be pruned correctly. However, on AMD it worked completely differently – even setting queuing for customers with a tariff of less than 50mbps further these customers pulled 10-20mbps. After these changes, only 1G and about 100kpps were queued from the overall 10-11G traffic and the problem continued. Later in the evening when the overall load dropped below 10G queuing started to work and everything was ok even after increasing the number of queues and classified traffic.
Does anyone have an idea what is going on here?
Be First to Comment