The best thing about HAProxy is its single process event driven architecture. It gives amazing performance in most real world scenarios and happily saturates gigabit+ networks...

But the single process design is also the worst thing about HAProxy, because until now the way of scaling to 10G+ has been to use multi-process mode — which while incredibly fast... doesn't allow you to share advanced functionality such as stick tables between the processes (which sucks!)

Luckily Willy Tarreau and the HAProxy development team are geniuses, and after several years of hard work not only have they given HAProxy a multi-threaded mode... But they've even managed to keep the event driven architecture — with a scheduler in each thread. My brain can't quite wrap around how difficult that would have been, but the result is damn near perfect.

So what are my options for scaling HAProxy?

It’s no big secret that a single processor can only handle so much processing in a given time. So what happens when you reach the limits of what a single processor can handle? Simple, you add more processors.

How does this equate in HAProxy? Well, we have four modes we can call HAProxy in.

Single process mode - This is the ‘classic’ mode, a single process is created which handles all the transactions for HAProxy. Simple but limited in what it can handle (as you’ll see later).

Multi-process mode - Multiprocessing is adding more number of or CPUs/processors to the system which increases the computing speed of the system.

Multi-thread mode - Multithreading is allowing a process to create more threads which increase the responsiveness of the system.

Difference between Multiprocessing and Multithreading

Combination mode - A mixture of multi-process and multi-thread mode where you get X processes with Y threads per process.

Baseline analysis

Before we can perform any form of comparison, we need to ascertain what we’re doing a comparison against. I decided to use ‘iPerf’ as it’s (fairly) simple to use and does what is needed.

I set up a test lab using 3 old Dell R210 servers, all equipped with new dual port Intel 10gb NICs.

The iPerf client and server were both installed with Debian 9 (stretch) and updated packages prior to test. The HAProxy system was configured with 3 NICs, one on each of client, server and management networks. The client and server networks used the dual port 10Gb Intel card whereas the management used the built in broadcom 1Gb.

[Iperf Client] <---> [HAProxy] <---> [Iperf Server]

As iPerf, by default, uses TCP port 5001 I setup a Layer 7 service listener directed to a single iPerf server and disabled the service health checks.

listen IPerf10
         bind transparent
         mode tcp
         option abortonclose
         server Real1  weight 100

On the iPerf client, I ran;

$ iperf -c -t 300 -P 1

  • c means “run as client mode, with the following as your server”
  • t means “number of seconds to run” (time)
  • P is “number of concurrent connections” (number of parallel jobs)

I chose a 5 minute (300 second) run so that it would give a chance for the traffic to settle without taking too long. I also chose a single connection as a base starting point.

The result of this was;

Client connecting to, TCP port 5001
TCP window size: 85.0 KByte (default)
[  3] local port 60282 connected with port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-300.0 sec   116 GBytes  3.32 Gbits/sec

3.32 Gbits/sec

I then decided I would see what happens if I increased the number of concurrent connections to 4.

Client connecting to, TCP port 5001
TCP window size: 85.0 KByte (default)
[  5] local port 60290 connected with port 5001
[  3] local port 60286 connected with port 5001
[  4] local port 60284 connected with port 5001
[  6] local port 60288 connected with port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-300.0 sec  30.4 GBytes   871 Mbits/sec
[  3]  0.0-300.0 sec  30.4 GBytes   871 Mbits/sec
[  4]  0.0-300.0 sec  30.4 GBytes   871 Mbits/sec
[  6]  0.0-300.0 sec  30.4 GBytes   871 Mbits/sec
[SUM]  0.0-300.0 sec   122 GBytes  3.48 Gbits/sec

3.48 Gbits/sec - Not a massive improvement in the grand scheme of things.

I also tried with 8 concurrent connections and that also made very little difference.

Time to try some changes in HAProxy...

Multi-process mode

Caveat 1: multi-process mode does not support synchronizing stick tables over multiple processes so each process will have it's own making it useless for persistence. However, HTTP cookies are supported so some persistence options do exist.

Caveat 2: multi-process mode means the stats page only displays one process per port, and requires additional config to show the different process stats

Required configuration changes;
‘global’ section;
Add “nbproc X” where X is an integer reflecting the number of processes you want to start.

for example;

    stats socket /var/run/haproxy.stat mode 600 level admin
    pidfile /var/run/
    maxconn 40000
    nbproc 4

##Multi-thread mode ***Required configuration changes;*** ‘global’ section; Add “nbthread Y” where Y is an integer reflecting the number of threads you want to start.

for example;

    stats socket /var/run/haproxy.stat mode 600 level admin
    pidfile /var/run/
    maxconn 40000
    nbproc 1
    nbthread 4

No other changes are needed to other sections.

iPerf Results

Processes Threads Concurrent Connections Throughput (Gbit/sec)
1 1 1 3.32
1 1 4 3.48
4 1 4 8.03
4 1 8 7.93
4 1 16 8.64
8 1 4 6.14
8 1 8 9.35
8 1 16 9.39
16 1 4 6.13
16 1 8 9.38
16 1 16 9.31
1 4 4 6.07
1 4 8 8.07
1 4 16 8.07
1 8 4 3.55
1 8 8 9.22
1 8 16 9.20
1 16 4 3.15
1 16 8 9.15
1 16 16 9.08
2 2 4 6.40
2 2 8 9.10
2 2 16 8.02
4 2 4 8.15
4 2 8 9.32
4 2 16 9.17
8 2 4 6.17
8 2 8 8.96
8 2 16 9.14
2 4 4 8.00
2 4 8 9.00
2 4 16 9.11
4 4 4 6.11
4 4 8 9.31
4 4 16 9.15


A single, non threaded, process of HAProxy can handle a throughput of 3.32GBit/sec, which is fine unless you have 10GBit NICs, at that point the more threads and processes you can throw at it, within reason, the better. You need to be mindful that the rest of the system also need processor time to continue functioning. When using multiple processes you lose the ability to use session persistence and the stats require some ugly additional config. Multiple threads, however, has most of the advantage of multiple processes (goes faster) but without the drawbacks outlined above.

I feel it's also important to highlight that this was purely a 'maximum throughput/saturation test', further testing will be conducted which will include 'ramp testing' (start small and work up) as well as longer test runs and differing payload size tests.

Test system

    HAProxy v1.9-dev0-929b52d-247 (git build)
    iPerf v2.0.9 (from Debian 'stable' repository)
    iPerf client and iPerf server systems built using Debian GNU/Linux stable.

    Manufacturer: Dell Inc.
    Product Name: PowerEdge R210
    CPU: Intel(R) Xeon(R) CPU X3430  @ 2.40GHz
    RAM: 2Gb DDR3 1333MHz
    NIC: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (Dual port)

I intentionally used legacy hardware (approx. 9 years old) to highlight the problem of a low clock speed when using single process mode.

If anyone is interested the hardware specification for our hardware load balancer — The Enterprise R20 CPU is E3-1230v5 @ 3.40GHz (and has been for the last 3 years.)

Coming soon...

Shortly I will do some proper testing with a Spirent Avalanche for all sorts of different packet sizes, to get a more realistic benchmark of real world scenarios.

That's if we can stop Mark the development manager playing with his new and very expensive toy...