Improving GlusterFS Performance

I’ve had a closer look at glusterfs in the last few days following the release of version 2.0.1. We often get customers approaching us with web apps dealing with user generated content which needs to be uploaded. If you have two or more servers in a load balanced environment, you usually have a few options, an NFS/CIFS share on one of them (single point of failure - failover NFS is, well…), a SAN (expensive), MogileFS (good, but alas not application agnostic),  periodically rsync/tar | nc files between the nodes (messy, not application agnostic and slow), store files in a database (not ideal for a number of reasons). There are a few other approaches and combinations of the above, but neither is perfect. GlusterFS solves this. It’s fast, instant and redundant! 

I’ve got four machines set up, two acting as redundant servers. Since they’re effectively acting as a RAID 1, each write is done twice over the wire, but that’s kind of inevitable. They’re all connected in a private isolated gigabit network. When dealing with larger files (a la cp yourfavouritedistro.iso /mnt/gluster) the throughput is really good at around 20-25 MB/s leaving the client. CPU usage on the client doing the copy was in the realms of 20-25% on a dual core. Very good so far! 

Then I tried many frequent filesystem operations, untarring the 2.6.9 linux kernel from and onto the mount.  Not so brilliant! It took 23-24 minutes from start to finish. The 2.6.9 kernel contain 17477 files and the average size is just a few kilobytes. This is obviously a lot of smaller bursts of network traffic!

After seeing this, I dove into the source code to have a look, when I reached the socket code, I realised that the performance for smaller files would probably be improved by a lot if Nagle’s algorithm was disabled on the socket. Said and done, I added a few setsockopt()s and went to test. The kernel tree now extracted in 1m 20s!

Of course there’s always the drawback.. In this case it is that larger files take longer to transfer as the raw throughput is decreasing (kernel buffer is a lot faster than a cat5!). Copying a 620 MB ISO from local disk onto the mount takes 1.20 s with the vanilla version of GlusterFS, and 3m 34s with Nagle’s algorithm disabled. 

I’m not seeing any performance hit on sustained transfer of larger files, but at the moment I’m guessing I’m hitting another bottleneck before that becomes a problem, as it “in theory” should have a slight negative impact in this case.

If you want to have a look at it, you can find the patch here. Just download to the source directory and do patch -p1 < glusterfs-2.0.1-patch-erik.diff  and then proceed to build as normal.

Until I’ve done some more testing on it and received some feedback, I won’t bother making it a tuneable in the vol-file just in case it’d be wasted effort!

Jun 4th, 2009