Firewall throughput measurements: OPNsense on APU4d4, OPNsense in a Proxmox VM, and OpenWRT on Turris Omnia
Why
For a few weeks, I have been struggling to make OPNsense perform well from a performance point of view on my low-power test box, an APU4d4. While OPNsense is very well done from a firewall rules management point of view (alhtough I am not happy that forwarding rules cannot specify both incoming and outgoing interfaces like it is possible with Linux Netfilter…) and has many features of expensive firewall products (including web interface based management for clustering/failover), the FreeBSD/HardenedBSD kernel seems to be struggling with higher throughputs. After not progressing with simple trial&error with various settings gathered from different howto guides, I decided to first measure my current status properly.
In the last ca. 10 years, I have running my home lab setup with OpenWRT based routers (for a long time on Mikrotik RB2011, which is extremely power efficient for what it can do), more recently a Turris Omnia for the automatic updates coupled with maximum flexbility (and the snapshot features are really well integrated). However, for teaching our course on “Network Security” at the Institute of Networks and Security at JKU Linz, we decided to use OPNsense because it comes with an easy-to-understand web interface and is open source. A direct comparison therefore seems useful.
All systems under test have a roughly equal IP (v4 and V6) and firewall rules configuration. For completeness, I compare the OPNsense installation on the APU4d4 to a similarly configured OPNsense instance inside a VM on the same Proxmox host.
How
My setup is pretty simple: a Proxmox server hosting a small number of VMs that are connected to a DMZ VLAN, attached through a Linux host bridge that connects virtio network interfaces for the VMs with a tagged VLAN on the hardware NIC as a trunk to the local Ethernet switch. On the same switch, I have a desktop connected through a 1Gbps link. The switch is configured as a pure L2 switch (with multiple VLANs), all routing is done through the firewalls under test. One VM on the Proxmox host runs the iperf3 service, the host itself (on a different VLAN) as well as the separate desktop run iperf3 clients.
The three systems to compare are:
Turris Omnia | APU4d4 OPNsense | VM OPNsense | |
---|---|---|---|
CPU | Marvell Armada 380/385 | AMD GX-412TC | Intel Celeron G3900 |
CPU speed | 1.6 GHz | 1 GHz (1.4 GHz boost) | 2.8 GHz |
CPU cores | 2 | 4 | 2 |
RAM | 2GB | 4GB | 4GB |
NIC | builtin | Intel i211AT | virtio (vhost_net) / Intel 82599 |
OS version | 5.1.10 (Linux kernel 4.14.222) | 21.1.4 (FreeBSD 12.1-RELEASE-p15-HBSD) | OPNsense 21.1 (FreeBSD 12.1-RELEASE-p12-HBSD) |
Power usage (under load) | 14-16W | 9-14W | marginal (the host is running anyways) |
OPNsense on the APU4d4 has the recommended settings from here and here applied. OPNsense inside the VM has NIC hardware offload features disabled and VM configured with recommended settings from here as well as all VLANs terminated on the Linux kernel host side and bridged into the VM as independent virtual network interfaces (the consensus seems to be that VLAN tag handling is faster on Linux than BSD in such a virtualized setting).
Results
First I took a baseline mearurement with an iperf3 client running on the VM host itself, connecting to an iperf3 server running within a Debian 10 VM without any of these test systems in the routing path, but simply a virtio network connection on a single VLAN / IP subnet. The limit was CPU bound, as my Proxmox host (wich a low-power CPU) ran around 90% load over both cores during this baseline test.
All measurements were taken with iperf3 in TCP mode with 1 or 4 parallel streams and 20 repetitions:
iperf3 -c <server IP> -P <number of streams> -t 20
Baseline | Average throughput (retry packets) |
---|---|
IPv4 1 stream | 4.47 Gbps (273 retries) |
IPv6 1 stream | 4.25 Gbps (229 retries) |
IPv4 4 streams | 4.45 (4233 retries) |
IPv6 4 streams | 4.48 Gbps (5247 retries) |
Measuring from the VM host to the VM (but on different VLANs, forcing traffic to be routed through the firewall under test), first without IPsec active (all transfer rates in Mbps):
VM->VM no IPsec | Turris Omnia | APU4D4 OPNsense | VM OPNsense |
---|---|---|---|
IPv4 1 stream | 695 (1485 retr) | 493 (1095 retr) | 1090 (148 retr) |
IPv6 1 stream | 422 (1327 retr) | 341 (719 retr) | 714 (156) |
IPv4 4 streams | 732 (10981 retr) | 736 (12236 retr) | 1140 (1993 retr) |
IPv6 4 streams | 415 (6570 retr) | 629 (10386 retr) | 793 (997 retr) |
Note that the VM->VM measurements, when going through the VM OPNsense instance, are all on the same physical host and therefore not bound by any hardware network limits, but only CPU and efficiency of the 3 network stacks involved. It’s interesting that IPv6 traffic was faster than IPv4 in this case.
Then with two IPsec tunnels to external sites configured and up/routed, but with the traffic under test explicitly not being routed through the tunnels. That is, those tunnel policies are loaded in the kernel, but the test traffic should not be matched by those policies. As we can see in the results, there is however a very clear performance impact on OPNsense when we hit CPU limits (I have only measured 4 streams, as we already know that single stream performance is limited on OPNsense with low power CPUs):
VM->VM with IPsec | Turris Omnia | APU4D4 OPNsense | VM OPNsense |
---|---|---|---|
IPv4 4 streams | 689 (10369 retr) | 551 (9520 retr) | 869 (1526 retr) |
IPv6 4 streams | 405 (5854 retr) | 413 (4911 retr) | 799 (430 retr) |
Repeating the measurements from a physically separate client, with the test traffic going through a physical switch to the (physical or virtual) firewall under test, then (for the two physical firewalls, not the VM one) through the same switch (different VLAN) and to the Proxmox host, where the VLAN tagged traffic is bridged into the VM running the iperf3 server:
VM->VM no IPsec | Turris Omnia | APU4D4 OPNsense | VM OPNsense |
---|---|---|---|
IPv4 1 stream | 771 (669 retr) | 525 (62 retr) | 737 (15 retr) |
IPv6 1 stream | 435 (586 retr) | 401 (38 retr) | 653 (9 retr) |
IPv4 4 streams | 777 (1805 retr) | 867 (302 retr) | 784 (661 retr) |
IPv6 4 streams | 413 (873 retr) | 663 (662 retr) | 699 (513 retr) |
VM->VM with IPsec | Turris Omnia | APU4D4 OPNsense | VM OPNsense |
---|---|---|---|
IPv4 4 streams | 737 (1577 retr) | 562 (304 retr) | 710 (505 retr) |
IPv6 4 streams | 402 (1584 retr) | 585 (206 retr) | 728 (220 retr) |
Conclusions
For standard routing and firewalling of multiple parallel streams, OPNsense on a low-power APU4d4 system performs a bit better (noticably better with IPv6) than a Turris Omnia with slightly less electrical power draw under load. OPNsense has the advantage of much nicer UI for firewall rules (including the possibility to define host objects and groups spanning IPv4 and IPv6), more control in terms of monitoring the firewall, nicely integrated modules like VPN protocols, and the beginnings of an API for automated configuration. Pretty much all of that can also be done with OpenWRT, but mostly on the shell or through config files of wide variety. None of these physical systems reach a full Gbps firewalling speed like the even lower powered Mikrotik systems with RouterOS and Fasttrack do.
However, there are currently two areas of concern:
- Single stream performance is worse, and this is a known problem for FreeBSD kernels. For a single stream (e.g. uploads/downloads to/from a local fileserver), OPNsense (both physical on the APU4d4 and virtual on a power efficient server CPU) is limited to about half the maximum Ethernet throughput. This may or may not be relevant to your use case.
- When IPsec is active - even if the relevant traffic is not part of the IPsec policy - throughput is decreased by nearly 1/3. This seems like a real performance issue / bug in the FreeBSD/HardenedBSD kernel. I will need to try with VTI based IPsec routing to see if the in-kernel policy matching is a problem.
- These tests intentionally deactivated some of the interesting OPNsense features such as traffic analysis with samplicate/flowd_aggregate. Enabling them will again cost around 150-200Mbps throughput on the APU4d4, stacked on top of the performance drop of IPsec if all are enabled.