GlusterFS Tcp_nodelay Patch Update

As mentioned in my previous post, I wrote a patch for GlusterFS to increase its performance when operating on many smaller files. Someone told me the other day that this functionality has been pushed to the git repository. Would have been good to have heard about this sooner…

So all of you who emailed me positive feedback and asked to make it a tuneable in the translator config (thanks!) - please check out  the above link to the git repository.

On another note, it seems as if they’re breaking away from having the protocol version bound to the release version, good progress in my opinion!

Jun 29th, 2009

Improving GlusterFS Performance

I’ve had a closer look at glusterfs in the last few days following the release of version 2.0.1. We often get customers approaching us with web apps dealing with user generated content which needs to be uploaded. If you have two or more servers in a load balanced environment, you usually have a few options, an NFS/CIFS share on one of them (single point of failure - failover NFS is, well…), a SAN (expensive), MogileFS (good, but alas not application agnostic),  periodically rsync/tar | nc files between the nodes (messy, not application agnostic and slow), store files in a database (not ideal for a number of reasons). There are a few other approaches and combinations of the above, but neither is perfect. GlusterFS solves this. It’s fast, instant and redundant! 

I’ve got four machines set up, two acting as redundant servers. Since they’re effectively acting as a RAID 1, each write is done twice over the wire, but that’s kind of inevitable. They’re all connected in a private isolated gigabit network. When dealing with larger files (a la cp yourfavouritedistro.iso /mnt/gluster) the throughput is really good at around 20-25 MB/s leaving the client. CPU usage on the client doing the copy was in the realms of 20-25% on a dual core. Very good so far! 

Then I tried many frequent filesystem operations, untarring the 2.6.9 linux kernel from and onto the mount.  Not so brilliant! It took 23-24 minutes from start to finish. The 2.6.9 kernel contain 17477 files and the average size is just a few kilobytes. This is obviously a lot of smaller bursts of network traffic!

After seeing this, I dove into the source code to have a look, when I reached the socket code, I realised that the performance for smaller files would probably be improved by a lot if Nagle’s algorithm was disabled on the socket. Said and done, I added a few setsockopt()s and went to test. The kernel tree now extracted in 1m 20s!

Of course there’s always the drawback.. In this case it is that larger files take longer to transfer as the raw throughput is decreasing (kernel buffer is a lot faster than a cat5!). Copying a 620 MB ISO from local disk onto the mount takes 1.20 s with the vanilla version of GlusterFS, and 3m 34s with Nagle’s algorithm disabled. 

I’m not seeing any performance hit on sustained transfer of larger files, but at the moment I’m guessing I’m hitting another bottleneck before that becomes a problem, as it “in theory” should have a slight negative impact in this case.

If you want to have a look at it, you can find the patch here. Just download to the source directory and do patch -p1 < glusterfs-2.0.1-patch-erik.diff  and then proceed to build as normal.

Until I’ve done some more testing on it and received some feedback, I won’t bother making it a tuneable in the vol-file just in case it’d be wasted effort!

Jun 4th, 2009

Don't Fix, Work Around - MySQL

I attended the MySQL EMEA conference last thursday where I enjoyed a talk from Ivan Zoratti titled “Scaling Up, Scaling Out, Virtualization – What should you do with MySQL?”

They have changed their minds quite a bit. Virtualisation in production is no longer a solid no-no according to them (a lot of people would argue). Solaris containers, anyone?

As most of us know by now, MySQL struggles to utilise multiple cores efficiently. This has been the case for quite some time by now, and people like Google and Percona has grown tired of waiting for MySQL to fix it.

Sun decided to not go down the route of reviewing and accepting the patches, but are now suggesting – are you sitting down? – running multiple instances on the same hardware. I’m not against this from a technical point of view as it currently actually does improve performance on multiple-core-multiple-disk systems (for an unpatched version) for some workloads, but the fact that they have gone to openly and officially suggest workarounds to their own problem rather than fixing the source of the problem is disturbing.

Granted, I suppose it makes sense to suggest larger boxes if you’ve been bought by a big-iron manufacturer. Also, I should be fair and note that Ivan at least didn’t say scaling out was a negative thing and that it’s still a good option.

If anyone asks me though, I think I’ll keep scaling outwards and use the more sensible version of MySQL

Oct 26th, 2008

Flush Bash_history After Each Command

If you, like me, often work in a lot of terminals on a lot of servers, or even a lot of terminals on the same one, you may recognise the frustration of a lost bash history. I don’t always gracefully log out of my sessions, so every so often my ~/.bash_history isn’t written and all my flashy commands are lost (the history buffer is only committed when you log out, everything that you see in history is not actually written to disk). I quite often find myself rewriting the same one-liners or long option list just because I closed my konsole or SecureCRT window without first logging out of all the sessions properly.

So I put some effort into finding a solution to this, and whilst reading through the bash manpage, I saw PROMPT_COMMAND. pling

1
export PROMPT_COMMAND='history -a'

To quote the manpage: “If set, the value is executed as a command prior to issuing each primary prompt.” So every time my command has finished, it appends the unwritten history item to ~/.bash_history before displaying the prompt (only $PS1) again.

So after putting that line in /etc/bashrc I don’t have to find myself reinventing wheels or lose valuable seconds re-typing stuff just because I was lazy with my terminals.

This is one of those things that I should have done ages ago, but never took the time to.

Oct 5th, 2008

Multiple Backends With Varnish

Varnish has been able to provide caching for more than one backend for quite some time. The achilles heel with this has up until now been that it hasn’t been able to determine whether a backend is healthy or not. This is now a problem of the past! The backend health polling code is available in 2.0 beta1 Sadly it had a bug, so when using the ‘random' director, it was unable to use the remaining healthy backend if all but one went MIA. I reported this bug and it was fixed in changeset r3174.

So as of now, you can safely use one varnish instance for several front-ends, thus eliminate double-caching (memory waste, unnecessary load on back-ends), reduce network traffic, do rudimentary load balancing, ease management etc. With the obscene amount of traffic Varnish can push without putting a fairly basic system under any load worth mentioning, you can use a single front-end to serve several nodes in most setups.

Here’s an elementary sample VCL for how to do this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
backend node0 {       
  .host = "127.0.0.1";
  .port = "80";
  .probe = { 
           .url = "/";
           .timeout = 50 ms; 
           .interval = 1s; 
           .window = 10;
           .threshold = 8;

  }
}

backend node1 {
  .host = "10.0.0.2";
  .port = "80";
  .probe = {       
#           .url = "/";
           .timeout = 100 ms;
           .interval = 1s;     
           .window = 10;     
           .threshold = 8;     
        .request =
            "GET /healthcheck.php HTTP/1.1"
            "Host: 10.0.0.2"
            "Connection: close"
            "Accept-Encoding: foo/bar" ;
  }
}
director cl1 random {
    { .backend = node0; .weight = 1; }
    { .backend = node1; .weight = 1; }
}

#director cl1 round-robin {
#   { .backend = node1; }
#   { .backend = node0; }
#}

sub vcl_recv {
        set req.backend = cl1;
}


As you can see I’m defining the backends slightly differently. You need to define one of .url or .request, but not both for obvious reasons. If you go for the slighly simpler .url the default request looks like this:

1
2
3
GET / HTTP/1.1
Host: something
Connection: close


If this does not suit your need, comment out .url and use .request to roll your own. This aspect of Varnish is actually quite well documented, so I won’t repeat what’s on the trac page.

There is clearly a lot more you can and, more often than not, should do in the VCL than the above. This is a stripped down version which only pertains to the backend polling functionality.

Sep 15th, 2008

Btrfs-filesystem-to-end-all-filesystems

There are some good stuff on the horizon! It’s called is is an btrfs (“butter-fs”). It was originally announced/”released” over a year ago by our friends at Oracle and has, in my opinion, not quite received the attention it deserves. I’m keeping a close eye on the very intensive devlopment of this as the feature list is very interesting from several aspects. It’s got some of the big names behind it and will undoubtedly be widely deployed and accepted into the vanilla kernel once stable.

btrfs, like ZFS, implements copy-on-write model, so yes – it will be able to do snapshots! Writeable ones at that. In fact, it’s got the ability to do snapshots of snapshots! Quasi-MVC filesystem! COW unfortunately makes a filesystem more prone to fragmentation, but luckily btrfs comes with online defragmentation and fs check abilities. The speed of read and write operations will obviously be impaired during such operations, but there’s always ways around that in most performance sensitive setups! If not, there should be! Sadly, COW isn’t that good of a choice for database workloads. But fret not, COW can be disabled with a mount option (-o nodatacow). This doesn’t mean you will lose the snapshot ability, as btrfs ignores this option if a data extent is referenced by more than one snapshot, so COW will, as far as I understand, be enabled from that you initiate a snapshot and stay that way until you’re done with it.

Early benchmarks show that btrfs is extremely fast at writing, and a little poorer at reading. It will be interesting to see how these numbers change as development proceeds. If added features will have any negative impact on performance. As a side note – I was quite surprised to see the poor numbers for ext3 in these benchmarks!

So if you’re a DBA and your data fits in memory, this filesystem will be right up your alley. With a reasonable amount of tables and some proper values for innodb_open_files and table_cache, I wouldn’t expect any remarkable difference in day-to-day database operation since the real bottleneck usually is in the hardware. This is generally speaking of course. I’m sure there are workloads out there which will benefit a lot more than “the norm”. Likewise, people with awkward read heavy setups with a lot of data in a lot of files may probably be better off not using btrfs. If you, like myself, often use blinks of an eye as a unit, you know what I’m talking about.

Yet another interesting functionality built in is the multiple device support. I will not call it a substitute for proper hardware based RAID, but could well be one for LVM (bearing the snapshots in mind as well)!

Another thing worth keeping an eye on is a related project; CRFS which may turn out to be a worthy NFS replacement. While it’s planned to get failover capabilities, I would much rather have seen a client-agnostic MogileFS-style implementation.

Sadly, they are not production ready yet. By far. But it’s something to look forward to. I’ll give it a version or two until I will put it under the microscope further and chuck some real world load onto it. Can’t wait!

Sep 4th, 2008

Tooltip - Inotify-tools

A nifty tool which might be handy when getting to know a system or tracking down I/O usage is inotify-tools . It’s a lightweight interface to the kernel’s inotify function. It gives a quick overview over which files are accessed and how in any given directory and it’s subdirectories (if asked to). It can quickly give an overview of efficiency of caching, frequency of commits, and all sorts of useful information for a whole range of applications.

Here’s some example output from the inotifywatch utility during a mysqlslap run:

1
2
3
4
5
6
7
8
9
# inotifywatch  /var/lib/mysql/mysqlslap/*
Establishing watches...
Finished establishing watches, now collecting statistics.
total  access  modify  close_write  close_nowrite  open  delete_self  filename
728    450     274     1            0              1     1            /var/lib/mysql/mysqlslap/t1.MYD
241    0       238     1            0              0     1            /var/lib/mysql/mysqlslap/t1.MYI
5      1       0       0            1              1     1            /var/lib/mysql/mysqlslap/t1.frm
2      0       0       0            0              0     1            /var/lib/mysql/mysqlslap/db.opt
.

Granted - as far as MySQL is concerned - most information is accessible through SHOW GLOBAL STATUS and/or SHOW ENGINE INNODB STATUS commands. But if you for instance have an erratic fear of the key_read_requests variable, you could always look at how often your MYI files are accessed. You catch my drift..

If you’re only interested in certain file operations, you can apply filters. If you for instance only have interest in file writes, your run would look like this:

1
2
3
4
5
6
7
8
9
# inotifywatch -e modify -e delete_self /var/lib/mysql/mysqlslap/*
Establishing watches...
Finished establishing watches, now collecting statistics.
total  modify  delete_self  filename
49     47      1            /var/lib/mysql/mysqlslap/t1.MYI
44     42      1            /var/lib/mysql/mysqlslap/t1.MYD
2      0       1            /var/lib/mysql/mysqlslap/db.opt
2      0       1            /var/lib/mysql/mysqlslap/t1.frm
.

You can monitor pretty much any file operation, so this tool can be used in a whole range of scenarios. Ever wondered just how many temp files your application creates or? Are you sure it doesn’t open and close the file handle for each operation? Do you want to know which file on your website is the most popular download at the moment but can’t wait until the webstats crontab has ran? I could go on…

inotify-tools come with another utility inotifywait. This tool looks for activity on a specified file or directory and instantly tells you which operation was performed. Nothing amazing, but I can see a few areas of use for that as well, though most of them have tools for that purpose already.

Aug 26th, 2008

Lighttpd 2.0

Linked from a post on the lighttpd blog is a page outlining the plans for lighttpd 2.0

While I’ve experienced some of the “oddities” they refer to, I have every bit of confidence in the developers. Even so - it’s a risky path to go down. They will most likely iron out the current shortcomings and oddities, but it’s fairly likely that a few new will be introduced along the way. I do however believe the planned use of the well proven glib is likely to prevent some of them.

It will be mighty interesting to see the impact of using libev for managing events. It certainly helps when “flying light”. Another interesting thing will be to see what plugins people come up with!

Graceful restarts will be most welcomned as well!

Go go lighttpd!

Aug 2nd, 2008

Some Trickery or Resilience With Varnish

As of now, Varnish has no means to detect whether a backend is available or at good health before sending a request (periodic checking is scheduled for ver 2.0 and will presumably work with the cluster mode as well). So if you’ve got two or more backends, and under some condition can’t or won’t serve a request immediately or want to send it elsewhere depending on some circumstance, you can do this using HTTP return code or header with the not-so-well-documented feature ‘restart' (then again, what feature is well documented in Varnish?).

‘restart’ will effectively increase a counter by 1 and re-run vcl_recv(). You can set how many times a restart should take place before giving up entirely - should you not use the counter in a condition prior to it reaching the limit - by starting varnishd with -p max_restarts=n or param.set max_restarts 1 on the CLI. This variable defaults to 4, and you can of course set conditions depending on the number of restarts.

Here’s a sample VCL to do this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
 backend be1 {
                .host = "127.0.0.1";
                .port = "81";
        }
        backend  be2 {
                .host = "10.0.0.2";
                .port = "81";
        }

        sub vcl_recv {
                if (req.restarts == 0) {
                        set req.backend = be1;
                } else if (req.restarts == 1) {
                        set req.backend = be2;
                }
        }

        sub vcl_fetch {
                if (obj.status != 200 && obj.status != 302) {
                        restart;
                }
        }

In this simple VCL, a request destined for this instance of Varnish which doesn’t return 200 or 302 from the backend, is effectively sent to 10.0.0.2 which may have something else in store for the visitor!

If I for instance use the above VCL and set be1 to return a 301 for / and send a request to Varnish, this is what shows up in varnishlog:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
...
10 ObjProtocol  c HTTP/1.1
*   10 ObjStatus    c 301*
   10 ObjResponse  c Moved Permanently
   10 ObjHeader    c Date: Tue, 22 Jul 2008 00:25:29 GMT
   10 ObjHeader    c Server: Apache/2.0.59 (CentOS)
   10 ObjHeader    c X-Powered-By: PHP/5.1.6
   10 ObjHeader    c Location: http://be1.northernmost.org:6081/links.php/
   10 ObjHeader    c Content-Type: text/html; charset=UTF-8
   13 BackendClose b be1
   10 TTL          c 1839681264 RFC 120 1216686329 1216686329 0 0 0
   10 VCL_call     c fetch
 *  10 VCL_return   c restart*
   10 VCL_call     c recv
   10 VCL_return   c lookup
   10 VCL_call     c hash
   10 VCL_return   c hash
   10 VCL_call     c miss
   10 VCL_return   c fetch
   12 BackendClose b be2
   *12 BackendOpen  b be2 *10.0.0.1 38478 10.0.0.2 81
   12 TxRequest    b GET
   12 TxURL        b /
   12 TxProtocol   b HTTP/1.1
...
   10 ObjProtocol  c HTTP/1.1
   10 ObjStatus    c 200
   10 ObjResponse  c OK
   10 ObjHeader    c Date: Mon, 21 Jul 2008 23:37:24 GMT
   10 ObjHeader    c Server: Apache/2.2.6 (FreeBSD) mod_ssl/2.2.6 OpenSSL/0.9.8e DAV/2
   10 ObjHeader    c Last-Modified: Thu, 10 Jul 2008 14:26:46 GMT
   10 ObjHeader    c ETag: "35e801-3-3702d580″
   10 ObjHeader    c Content-Type: text/html
   12 BackendReuse b be2
...

You can of course use this for very basic resilience as well, but that’s definitely a job for your load balancer. Also be aware about the overhead in this, since the request after all is sent to the backend and processed before passed on to the other node.

Maybe it’s not the most useful feature in the world, but I thought it was nifty!

Jul 22nd, 2008

LVM With Dmraid

When adding a new disk for a customer running CentOS 4.7 on severely old hardware, I bumped into something I’ve never had happening to me before. Basically the system wouldn’t let me create the Physical Volume and gave me this message:

1
2
[root@gwyneth ~]# pvcreate /dev/hdc1
Can't open /dev/hdc1 exclusively.  Mounted filesystem?

hdc1 was obviously not mounted or in any other way used. Or so I thought. As I was flicking through the loaded kernel modules, I saw the dm_* modules being loaded and I was quite sure I knew what it was at that point. dmraid is hogging the disk. I verified that dmraid was indeed aware of the new disk with:

1
2
[root@gwyneth ~]# dmraid -r
/dev/hdc: pdc, "pdc_hceidaeha", mirror, ok, 78125000 sectors, data@ 0

So deactivating the (in)appropriate RAID set:

1
2
3
[root@gwyneth ~]# dmraid -a no pdc
[root@gwyneth ~]# pvcreate /dev/hdc1
Physical volume "/dev/hdc1" successfully created

Obviously, if you currently aren’t or don’t plan on ever run any software RAID setup (which you shouldn’t on a server in my opinion), you do best in removing the dmraid package altogether to avoid the same or similar problems in the future.

Jul 15th, 2008