HAProxy is awesome. So awesome in fact, that here at Loadbalancer.org HQ - I find it very difficult to generate enough load to break it...so let's try harder!

Spirent3100

I know from my previous 10G testing using the very expensive IXIA Optixia XM2 chassis equipped with two Xcellon-Ultra NP blades that our appliances have no problem saturating 10G networks with small or large packets, low or high concurrency.

But personally, I've had a nagging worry that our new Enterprise Ultra 20 core monster, might actually have a lower real world performance than our entry level Enterprise R20 (4 core Xeon).

Why was I worried?

Because the Ultra may have 20 cores, but it has a lower CPU clock speed of 2.3GHz, rather than 3.4GHz on the Enterprise R20.

HAProxy can happily be configured for multiple-cores when you are using SSL or Cookie based persistence - however if you need to use a stick-table this is currently only effective on one core (NBPROC=1).

NB: Willy Tarreau is working on multi-threading and HTTP2 support right now...

We were also just about to release a major kernel upgrade and wanted to have a consistent benchmark test before release to ensure we didn't get any nasty surprises.

I decided to spend a whole week performance testing. I carried out the tests on a selection of our appliance platforms and the almost as expensive Spirent Avalanche test hardware.

The Spirent was configured as both the load generator (client side) and as the responding servers (server side). I used a pretty simple configuration split across physical 1Gb interfaces on the load balancer. This made sure we had enough bandwidth to carry out the tests on HAProxy.

Why not 10G interfaces?

Because I'm trying to test a crazy scenario to deliberately max out HAProxy on one core. No production site would ever be pushing 1 Gbps of 1 byte packets, and my boss said the 10G Spirent was too expensive :-(

The VMware & Hyper-V appliances were configured in the same way, the internal interfaces were mapped to physical interfaces via VSwitches. The host hardware being on bare metal of a spare 20 Core server.

All numbers are quoted in TPS or Transactions Per Second. A transaction is 1 complete HTTP GET request (/index.html) and its coresponding response.

The Spirent Avalanche was configured to be brutal...

  • The servers were replying with a 1 byte response or a '*' to be more accurate.
  • No HTTP session re-use.
  • No HTTP Keep-alive. (no cheating!)
  • 1 request per client with a 1ms delay per SimUser per request
  • Approximately 200 concurrent SimUsers on a /24 subnet

Incidentally, specifying no delay enables 'Burst Mode' which breaks everything. I assume it starts attempting to flood the network as fast as it can.

IPTables was stopped for the duration of the tests (apart from when we were using NAT). This stopped the nf_conntrack module from filling up the connection table and killing the performance. We disabled syncookie protection too, as this can also effect performance when getting to a high number of TPS.

The appliances tested were the Enterprise R20 and the new Enterprise Ultra.

HAProxy & HTTP Tests

For the first test - HAProxy was set to start 1 process but was not bound to any particular core. Neither were the IRQ queues i.e. 'HAProxy nbproc 1 No IRQ Tuning'.

Other than disabling syn_cookies and IPTables nothing was done to the box. It was exactly how the customer would receive it.

Result: Yes, the Ultra was slower than a MAX as expected, but still very fast at 44,000 TPS.

I then played with binding processes and IRQs...

I used the HAProxy configuration nbproc 1 and cpu-map 1 0. This tells HAProxy to start one process and map process 1 to core 0.

I disabled the IRQ Balance service (which does give great results for Layer 4 LVS traffic), and then ran the set_irq_affinity.sh script from Intel that allows you to bind the IRQ Queues to a specific core, group of cores, even cores or odd cores.

This worked fantastically well, assigning HAProxy to one core and the IRQ queues to another.

set_irq_affinity.sh remote eth0
set_irq_affinity.sh remote eth1

One of the big things I really noticed when doing these tests was the impact of context switching (moving process jobs around cores) on the maximum TPS. So I decided to play with moving the queues around and then using one CPU.

On the first test, I tried binding the remaining 38 queues to IRQ (two per processor core), this created a performance botle neck. Then I tried various other combinations. It wasn't until the queues were split across 3 cores that I then saw the numbers I expected. i.e. using CPU 0 with 1 core for HAProxy and 3 cores for IRQ.

set_irq_affinity.sh 1,3,5 eth0
set_irq_affinity.sh 1,3,5 eth1

It has been suggested (via online sources) that you can gain a further improvement by putting the IRQ queues and the HAProxy process near to each other core wise - and then you get the benefit of them sharing the cache. We tried putting HAProxy on core 1 and moving the IRQ queues from core 0 to core 2 hoping that we would hit the cores shared cache but this yielded no discernible improvement. It appears that with newer Intel processors the cores don't share l2 cache only l3 so this might explain why it made no difference.

Result: Seperating the IRQ from HAProxy is effective and it requires at least 3 cores for performance and too many queues cause a blockage.

Finally, just for fun we checked that we could get 150000 TPS through HAProxy by using nbproc 8 binding each process to a core and moving the IRQ queues to the other processor. This also yielded a maximum packets per second rate of 1.3 million. Which is a lot :-).

SSL - STunnel & HTTPS Tests

For the SSL tests we had STunnel pointing directly to HAProxy over localhost. I tried using a unix socket instead of localhost but this made no apparent difference.

NB. Having HAProxy on localhost gives you a 10% performance gain

I used a 2048 bit self signed certificate with the cipher EDCHE-RSA-AES256-GCM-SHA384. This is close to the default cipher in our appliance.

But check out the amazing results Intel got here with the ECDHE-ECDSA cipher!

I then switched the clients to request over HTTPS instead. Everything else remained the same.

With reference to the Ultra we had STunnel bound to one processor (not just one core obviously) and the IRQ queues to another. We noticed that STunnel does not scale well across processors. You could see that as we pushed it beyond one of the 10 core processors that context switching became an issue again and really effected throughput.

However, we would expect most customers to be running more than just one service on an Ultra - so the extra 10 Core CPU gives you plenty of horse power for multiple applications at the same time.

Result: SSL is really, really fast - can anyone push this kind of traffic on a production site? Also, as you are only using one of the CPUs for this you have ample power left over for HAProxy.

VMware Tests

For VMware, I was only interested to find out if running a hypervisor in between the appliance and the hardware would impact the TPS. Also note, nothing else was running on the server while running the tests. It was an ideal situation.

Result: VMware is very fast with approximately 15% overhead compared to bare metal.

Hyper-V Tests

I was interested to see if the new kernel v4.4.49-lb2 was faster than v2.6.35 kernel. The LIS drivers have been improved since 2.6.35 was release and it was believed would show a marked improvement in performance.

Result: The v4.4.49-lb2 was 40% faster on Hyper-V as expected - and definitely fast enough for most people. But nothing like as good as VMware.

Table showing all tests (HAProxy restricted to one core only):

APPLIANCE ENTERPRISE R20 ENTERPRISE ULTRA ENTERPRISE VA (VMWare) ENTERPRISE VA (Hyper-V)
CPU 3.4 Ghz quad core 2x 10 core 2.3Ghz 1vcpu 2vcpu 3vcpu 2.6.35 kernel 4.4.49-lb2
HAProxy nbproc 1 (No Tuning) 55,000 TPS 44,000 TPS3 33,000 TPS 41,000 TPS 40,000 TPS 9,000 TPS 16,000TPS
HAProxy nbproc 1 (IRQs Tuned) 77,000TPS 53,000 TPS 33,000 TPS 48,000 TPS 50,000 TPS
SSL 2,000 TPS 6,000 TPS
SSL process pinning IRQ Tuned 3,000 TPS 8,000 TPS
Layer 4 NAT 90,000 TPS 90,000 TPS


Why was I a little scared of publishing these results?

Because some of the numbers don't look that big, and we reveal some obvious limitations of our own product.

And because I think it's far more important to be honest with benchmarks, this data could well be useful to other members of the HAProxy community.

Conclusion

You definitely can blow up HAProxy when using NBPROC=1, but it does fail gracefully. And seriously can you imagine anyone pushing that kind of traffic?

Other HAProxy benchmarks: