Update 4/8/2017: This issue is actually caused by an underlying memory allocation deadlock bug in the ib_mthca kernel driver. The only true fix is to buy new hardware which does not use that driver and instead uses mlx4_ib. The following is left up for reference.

For the last 6 months, I’ve had the most frustrating issue with my 20Gbps Infiniband home network. Copying files from my file server would ultimately cause NFS to lockup for an indeterminate period of time. During that time, which could be anywhere from 3 seconds to 3 hours, there is no traffic on the IB link whatsoever. I spent several weekends chasing red herrings – everything from hardware issues like bad cables and overheating HCAs, to needless kernel driver changes and firmware upgrades. From my testing I determined some interesting properties of the failure mode:

  • The failure was not limited to just NFS, it also happened with Samba (SMB/CIFS) and would even affect iperf.
  • When the connection was stalled, the IB link still works. I could SSH over it just fine.
  • The issue only affected the outgoing direction of iperf from my workstation. It could receive fine, just not send.
  • When NFS stalls, it takes 200% CPU. When anything else stalls, it uses no extra CPU.
  • The last packet sent from the client is always an ACK. TCP keepalive packets are sent every 60 seconds while the connection hangs. The transfer(s) are otherwise normal, with the only exception being the huge delay in between packets. (I’m using TCP rather than RDMA due to a kernel bug affecting 3.15 and earlier.)

I was convinced that this was a kernel bug, and my suspicions were confirmed, but not in the way I’d expected. It turns out that the issue is caused by the Linux kernel swapper daemon, kswapd. I noticed that every time NFS would lockup, the CPU would spike to 200%. One of those processes was a kworker, the other was kswapd.

Wait a second. I don’t have any swap.

This immediately struck me as odd. What would the kernel be trying to swap? I was only using 50% of 16GiB of RAM, so there shouldn’t be any issues regarding OOM conditions. A quick Google search returned some interesting info. Apparently this has been a known issue for a while, and according to some, kswapd is in need of a rewrite for modern systems.

From what I understand, during the file transfer(s), some data is cached in RAM as usual. As the transfers progress, the RAM slowly fills to capacity with cached data. I’m also running ZFS, so that can’t help with the caching, as its ARC will try to use 7/8th of the available memory for itself. At some point, kswapd scans through all the memory looking for what to evict to disk. This process takes 100% CPU and blocks NFS while it does… something?

This leaves the question as to what we can do. The short answer is that there isn’t a proper fix at the moment of writing this, but there are a few workarounds.

The first is to set vm.swappiness = 0 to reduce the likelihood that kswapd will want to ruin your day. This change will not solve the issue on its own, but it does seem (at least in my case) to reduce the severity. You can set this parameter on the fly by running sudo sysctl -w vm.swappiness=0, and you can make it permanent by adding vm.swappiness=0 to /etc/sysctl.conf.

The real workaround is to clear the cache completely. This frees up a massive amount of memory and immediately removes anything for kswapd to do. You can clear the cache by running as root: echo 3 > /proc/sys/vm/drop_caches. My solution was to create a bash script which runs that command once every 10 minutes. In my cases that is enough to prevent kswapd from ever doing anything, and my NFS transfers running. I just start the script and keep it running while I’m doing intensive transfers. I kill it when I don’t need it, as there will be an increase in I/O in general, it is purging cache after all.

I hope that this issue gets addressed soon in the mainline kernel.