<![CDATA[All things sysadmin]]> 2015-10-30T18:41:21+00:00 http://northernmost.org/blog// Octopress <![CDATA[sar rebooting ubuntu]]> 2015-10-30T17:27:45+00:00 http://northernmost.org/blog//sar-rebooting-ubuntu/sar-rebooting-ubuntu Today I had a colleague approach me about a oneliner I sent him many months ago, saying that it kept rebooting a server he was running it on.

It was little more than running sar in a loop, extract some values and run another command if certain thresholds were exceeded. Hardly anything that you’d think would result in a reboot.

After whittling down the oneliner to the offending command, it turned out that sar was the culprit. Some further debugging revealed that sar merely spawns a process called sadc, which does the actual heavy lifting.

In certain circumstances, if you send SIGINT (ctrl+c, for example) to sar, it can exit before sadc has done its thing.
When that happens, sadc becomes an orphan, and /sbin/init being a good little init system, takes it under its wing and becomes its parent process.

When sadc receives the SIGINT signal, it’s signal handler will pass it up to its parent process… You see where this is going, right?
Yep, /sbin/init gets the signal, and does what it should do. Initiates a reboot.

If you want to reboot an Ubuntu 14.x server, simply run this in a terminal (as root, this is NOT a DoS/vulnerability, merely a bug):

1
2
3
4
5
6
7
8
9
root@elcrashtest:~# echo $(sar -b 1 5)
^C
root@elcrashtest:~# ^C
root@elcrashtest:~#
Broadcast message from root@elcrashtest
    (unknown) at 18:06 ...

    The system is going down for reboot NOW!
    Control-Alt-Delete pressed

Rapidly hitting ctrl+c twice does the trick.
Obviously this command doesn’t make sense to run in isolation, but the bug was hit in the context of a more involved oneliner, and being in a subprocess seem to trigger it more often. You may need to run it a couple of times as a few things need to line up for it to happen. The above command reboots the server like 8-9/10 times.

If executed in another subshell, you only need to hit ctrl+c once to trigger it.

A more unrealistic, but sure-fire way to trigger it looks like this:

1
2
3
4
5
root@elcrashtest:~# sar -b 1 100 > /dev/null &
[1] 3777
root@elcrashtest:~# kill -SIGKILL $! ; kill -SIGINT $(pidof sadc);
Broadcast message from root@elcrashtest
...

Basically killing sar forcefully (thus orphaning sadc), and then send SIGINT to sadc. This has a 100% success rate.

This was fixed in 2014, but Canonical has neglected to backport it.
A colleague of mine, who is a much better OSS citizen than I am, has raised this with Canonical

I only tested this on Ubuntu 14.04 and 14.10. Debian and RedHat/CentOS does not appear to suffer from this. It’s surprising that it’s still present in Ubuntu Trusty, since this is backported in Debian Jessie.

Only on a Friday afternoon…

]]>
<![CDATA[GRE tunnels and UFW]]> 2015-09-14T19:17:10+01:00 http://northernmost.org/blog//gre-tunnels-and-ufw/gre-tunnels-and-ufw Today I wrote an Ansible playbook to set up an environment for a docker demo I will be giving shortly. In the demo I will be using three hosts, and I want the containers to be able to speak to each other across hosts. To this end, I’m using Open vSwitch. The setup is quite straight forward, set up the bridge, get the meshed GRE tunnels up and off you go.
I first set this up in a lab, with firewalls disabled. But knowing that I will give the demo on public infrastructure, I still wrote the play to allow everything on a particular interface (an isolated cloud-network) through UFW.
When I ran my playbook against a few cloud servers, the containers couldn’t talk to each other on account of the GRE tunnels not working.

So I enabled logging in UFW, and soon started seeing these types of entries

1
2
3
[UFW BLOCK] IN=eth2 OUT= MAC=<redacted>
SRC=<redacted> DST=<redacted> LEN=76 TOS=0x00 PREC=0x00 TTL=64 ID=36639 DF
PROTO=47

Upon checking which rule actually dropped the packets (iptables -L -nv), it transpired that the culprit was

1
2
1    97 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0
ctstate INVALID

It turns out that a change in the 3.18 kernel and onwards means that unless either of the nf_conntrack_pptp or nf_conntrack_proto_gre modules are loaded, any GRE packets will be marked as INVALID, as opposed to NEW and subsequently ESTABLISHED.

So in order to get openvswitch working with UFW, there are two solutions; Either explicitly allow protocol 47, or load one of the aforementioned kernel modules.

Should you go for the former solution, this is the rule you need to beat to the punch:

1
2
3
4
$ grep -A 2 "drop INVALID" /etc/ufw/before.rules
# drop INVALID packets (logs these in loglevel medium and higher)
-A ufw-before-input -m conntrack --ctstate INVALID -j ufw-logging-deny
-A ufw-before-input -m conntrack --ctstate INVALID -j DROP

with -A ufw-before-input -p 47 -i $iface -j ACCEPT

]]>
<![CDATA[LVM thinpool for docker storage on Fedora 22]]> 2015-09-08T10:31:58+01:00 http://northernmost.org/blog//lvm-thinpool-for-docker-storage-on-fedora-22/lvm-thinpool-for-docker-storage-on-fedora-22 TL;DR: You can use docker-storage-setup without root fs being on LVM by passing DEVS and VG environment variables to the script or editing /etc/sysconfig/docker-storage-setup

I stumbled across this article the other day ‘Friends Don’t Let Friends Run Docker on Loopback in Production’

I also saw this bug being raised, saying docker-storage-setup doesn’t work with the Fedora 22 cloud image, as the root fs isn’t on LVM.

I decided to try this out, so I created some block storage and a Fedora 22 VM on the Rackspace cloud:

1
2
3
4
5
6
7
8
$ cinder create --display-name docker-storage --volume-type 1fd376b5-c84e-43c5-a66b-d895cb75ac2c 75
# Verify that it's built and is available
$ cinder show 359b01b7-541c-4f4d-b2e7-279d778079a4
# Build a Fedora 22 server with the volume attached
nova boot --image 2cc5db1b-2fc8-42ae-8afb-d30c68037f02 \
--flavor performance1-1 \
--block-device-mapping xvdb=359b01b7-541c-4f4d-b2e7-279d778079a4 \
docker-storage-test

Once on the machine, I followed the article above:

1
2
3
$ dnf -y install docker
$ systemctl stop docker
$ rm -rf /var/lib/docker/

And here’s where the bug report I linked earlier comes into play. docker-storage-setup is just a bash script, and if you just take a look at this output:

1
2
3
4
5
6
7
docker-storage-setup --help
Usage: /usr/bin/docker-storage-setup [OPTIONS]

Grows the root filesystem and sets up storage for docker.

Options:
  -h, --help            Print help message.

It sure gives the impresson of only doing one single thing - growing the root FS! As the bug rightly points out, the Fedora cloud image doesn’t come with LVM for the root FS (which is a good thing!), so there’s no VG for this script to grow.

So unless you read the script, or the manpage, you wouldn’t necessarily notice that what --help says is just the default behaviour, and you can use docker-storage-setup to use an emphemeral disk and leave the root fs alone. The kicker lies in two environment variables (as opposed to arguments to the script itself, like is more common); $DEVS and $VG. If you supply both of those, and the disk you give in DEVS has no partition table and the VG you supply doesn’t exist, the script will partition the disk and create all the necessary bits for LVM on that disk:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Verify that ephemeral disk has no partition table:
$ partx -s /dev/xvdb
partx: /dev/xvdb: failed to read partition table

# Start lvmetad
$ systemctl start lvm2-lvmetad

$ DEVS="/dev/xvdb" VG="docker-data" docker-storage-setup
  Volume group "xvda1" not found
  Cannot process volume group xvda1
Checking that no-one is using this disk right now ... OK

Disk /dev/xvdb: 75 GiB, 80530636800 bytes, 157286400 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

>>> Script header accepted.
>>> Created a new DOS disklabel with disk identifier 0x2b7ebb69.
Created a new partition 1 of type 'Linux LVM' and of size 75 GiB.
/dev/xvdb2:
New situation:

Device     Boot Start       End   Sectors Size Id Type
/dev/xvdb1       2048 157286399 157284352  75G 8e Linux LVM

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
  Physical volume "/dev/xvdb1" successfully created
  Volume group "docker-data" successfully created
  Rounding up size to full physical extent 80.00 MiB
  Logical volume "docker-poolmeta" created.
  Logical volume "docker-pool" created.
  WARNING: Converting logical volume docker-data/docker-pool and docker-data/docker-poolmeta to pool's data and metadata volumes.
  THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
  Converted docker-data/docker-pool to thin pool.
  Logical volume "docker-pool" changed.

# Verify that the script wrote the docker-storage file
$ cat /etc/sysconfig/docker-storage
DOCKER_STORAGE_OPTIONS=--storage-driver devicemapper --storage-opt dm.fs=xfs
--storage-opt dm.thinpooldev=/dev/mapper/docker--data-docker--pool
--storage-opt dm.use_deferred_removal=true

# Verify that the LV is there:
$ lvs
  LV          VG          Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-pool docker-data twi-a-t--- 44.95g             0.00   0.07

So now the script has created the LV thinpool, and written the required docker configuration.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$ systemctl start docker
$ docker info
Containers: 0
Images: 0
Storage Driver: devicemapper
 Pool Name: docker--data-docker--pool
 Pool Blocksize: 524.3 kB
 Backing Filesystem: extfs
 Data file:
 Metadata file:
 Data Space Used: 19.92 MB
 Data Space Total: 48.26 GB
 Data Space Available: 48.24 GB
 Metadata Space Used: 65.54 kB
 Metadata Space Total: 83.89 MB
 Metadata Space Available: 83.82 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Library Version: 1.02.93 (2015-01-30)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.0.8-300.fc22.x86_64
Operating System: Fedora 22 (Twenty Two)
CPUs: 1
Total Memory: 987.8 MiB
Name: docker-storage-test
ID: EYKV:Q5D6:4F3Y:Z5X3:ZILX:ZBVI:2YF6:VHD7:RFQS:IWWO:MOFL:EWO7

No trace of /dev/loop0! And to verify that it’s actually using our thinpool:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ lvdisplay | egrep "Allocated pool data" ; du -sh /var/lib/docker/ ; docker pull centos:6 ; du -sh /var/lib/docker ; lvdisplay | egrep "Allocated pool data"
  Allocated pool data    0.04%
5.6M    total
6: Pulling from docker.io/centos
47d44cb6f252: Pull complete
6a7b54515901: Pull complete
e788880c8cfa: Pull complete
1debf8fb53e6: Pull complete
72703a0520b7: Already exists
docker.io/centos:6: The image you are pulling has been verified. Important: image verification is a tech preview feature and should not be relied on to provide security.
Digest: sha256:5436a8b20d6cdf638d936ce1486e277294f6a1360a7b630b9ef76b30d9a88aec
Status: Downloaded newer image for docker.io/centos:6
5.8M    total
  Allocated pool data    0.53%

In conclusion - the script could definitely do with being updated to using command line arguments for this, rather than environment variables, and update the –help output to highlight this.

]]>
<![CDATA[readdir and directories on xfs]]> 2015-01-16T13:34:46+00:00 http://northernmost.org/blog//readdir-and-directories-on-xfs/readdir-and-directories-on-xfs Recently I had some pretty unexpected results from a piece of code I wrote quite a while ago, and never had any issues with. I ran my program on a brand new CentOS 7 installation, and the results weren’t at all what I was used to!

Consider the following code (abridged and simplified):

readdir_xfs.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <stdio.h>
#include <dirent.h>
#include <sys/types.h>

void recursive_dir(const char *path){

  DIR *dir;
  struct dirent *de;

  if (!(dir = opendir(path))){
    perror("opendir");
    return;
  }
  if (!(de = readdir(dir))){
    perror("readdir");
    return;
  }

  do {

    if (strncmp (de->d_name, ".", 1) == 0 || strncmp (de->d_name, "..", 2) == 0) {
      continue;
    }

    if (de->d_type == DT_DIR){
      char full_path[PATH_MAX];
      snprintf(full_path, PATH_MAX, "%s/%s", path, de->d_name);
      printf("Dir: %s\n", full_path);
      recursive_dir(full_path);
    }
    else {
      printf("\tFile: %s%s\n", path, de->d_name);
    }
  } while (de = readdir(dir));
  closedir(dir);

}

int main(int argc, char *argv[]){

  if (argc < 2){
    fprintf(stderr, "Usage: %s <dir>\n", argv[0]);
    return 1;
  }

  recursive_dir(argv[1]);
  return 0;
}

Pretty straight forward - reads directories, prints out them and the files within them. Now here’s the kicker:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ gcc -g dirtraverse.c -o dirtraverse && ./dirtraverse /data_ext4/
Dir: /data_ext4//dir1
        File: /data_ext4//dir1file3
        File: /data_ext4//dir1file1
        File: /data_ext4//dir1file2
Dir: /data_ext4//dir2
        File: /data_ext4//dir2file1
Dir: /data_ext4//dir3
$ rsync -a --delete /data_ext4/ /data_xfs/  # Ensure directories are identical
$ gcc -g dirtraverse.c -o dirtraverse && ./dirtraverse /data_xfs/
        File: /data_xfs/dir1
        File: /data_xfs/dir2
        File: /data_xfs/dir3

No traversal?

After a bit of head scratching, and a few debug statements, I found that when using readdir(3) on XFS, dirent->d_type is always 0! No matter what type of file it is. This means that line #25 can never be true.

To be fair though, the manpage states that POSIX only mandates dirent->d_name.

So to be absolutely sure your directory traversal code is more portable, make use of stat(2) and the S_ISDIR() macro!

]]>
<![CDATA[How does MySQL hide the command line password in ps?]]> 2012-03-10T05:03:46+00:00 http://northernmost.org/blog//how-does-mysql-hide-the-command-line-password-in-ps/how-does-mysql-hide-the-command-line-password-in-ps I saw this question asked today, and thought I’d write a quick post about it. Giving passwords on the command line isn’t necessarily a fantastic idea - but you can sort of see where they’re coming from. Configuration files and environment variables are better, but just slightly. Security is a night mare!

But if you do decide to write an application which takes a password (or any other sensitive information) on the command line, you can prevent other users on the system from easily seeing it like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>

int main(int argc, char *argv[]){

    int i = 0;
    pid_t mypid = getpid();
    if (argc == 1)
        return 1;
    printf("argc = %d and arguments are:\n", argc);
    for (i ; i < argc ; i++)
        printf("%d = %s\n" ,i, argv[i]);
    printf("Replacing first argument with x:es... Now open another terminal and run: ps p %d\n", (int)mypid);
    fflush(stdout);
    memset(argv[1], 'x', strlen(argv[1]));
    getc(stdin);
        return 0;
}

A sample run looks like this:

1
2
3
4
5
$ ./pwhide abcd
argc = 2 and arguments are:
0 = ./pwhide
1 = abcd
Replacing first argument with x:es... Now run: ps p 27913

In another terminal:

1
2
3
$ ps p 27913
  PID TTY      STAT   TIME COMMAND
27913 pts/1    S+     0:00 ./pwhide xxxx

In the interest of brevity, the above code isn’t very portable - but it works on Linux and hopefully the point of it comes across. In other environments, such as FreeBSD, you have the setproctitle() syscall to do the dirty work for you. The key thing here is the overwriting of argv[1] Because the size of argv[] is allocated when the program starts, you can’t easily obfuscate the length of the password. I say easily - because of course there is a way.

]]>
<![CDATA[Font rendering - no more jealousy]]> 2012-02-28T20:02:17+00:00 http://northernmost.org/blog//font-rendering-no-more-jealousy/font-rendering-no-more-jealousy I suppose this kind of content is what most people use twitter for these days. But since I’ve remained strong and stayed well away from that, I suppose I will have to be a tad retro and write a short blog post about it. If you like me are an avid Fedora user, I’m sure you’ve thrown glances at colleague’s or friend’s Ubuntu machines and thought that there was something that was slightly different about the way it looked (aside from the obvious Gnome vs Unity differences). Shinier somehow…; So had I, but I mainly dismissed it as a case of “the grass is always greener…”.

It turns out that the grass actually IS greener.

Tonight I stumbled upon this. It’s a patched version of freetype. For what I assume are political reasons (free as in speech), Fedora ships a Freetype version without subpixel rendering. These patches fixes that and other things.

With a default configuration file of 407 lines, it’s quite extensible and configurable as well. Lucky, I quite like the default!

If you’re not entirely happy with the way your fonts look on Fedora - it’s well worth a look

]]>
<![CDATA[Transactions and code testing]]> 2011-08-18T13:08:29+01:00 http://northernmost.org/blog//transactions-and-code-testing/transactions-and-code-testing A little while ago I worked with a customer to migrate their DB from using MyISAM to InnoDB (something I definitely don’t mind doing!) I set up a smaller test instance with all tables using the InnoDB engine as part of the testing. I instructed them to thoroughly test their application against this test instance and let me know if they identified any issues.

They reported back that everything seemed fine, and we went off to do the actual migration. Everything went according to plan and things seemed well. After a while they started seeing some discrepancies in the stock portion of their application. The data didn’t add up with what they expected and stock levels seemed surprisingly high. A crontabbed program was responsible for periodically updating the stock count of products, so this was of course the first place I looked. I ran it manually and looked at its output; it was very verbose and reported some 2000 products had been updated. But looking at the actual DB, this was far from the case.

Still having the test environment available, I ran it a few times against that and could see the com_update and com_insert counters being incremented, so I knew the queries were making it there. But the data remained intact. At this point, I had a gut feeling what was going on.. so to confirm this, I enabled query logging to see what was actually going on. It didn’t take me long to spot the problem. On the second line of the log, I saw this:

1
       40 Query set autocommit=0

The program responsible for updating the stock levels was a python script using MySQLDB. I couldn’t see any traces of autocommit being set explicitly, so I went on assuming that it was off by default (which turned out to be correct). After adding cursor.commit()* after the relevant queries had been sent to the server, everything was back to normal as far as stock levels were concerned. Since the code itself was seeing its own transaction, calls such as cursor.rowcount which the testers had relied on were all correct.

But the lesson here; when testing your software from a database point of view, don’t blindly trust what your code tells you it’s done, make sure it’s actually done it by verifying the data! A lot of things can happen to data between your program and the platters. Its transaction can deadlock and be rolled back, it can be reading cached data, it can get lost in a crashing message queue, etc.

As a rule of thumb, I’m rather against setting a blanket autocommit=1 in code, I’ve seen that come back to haunt developers in the past. I’m a strong advocate for explicit transaction handling.

]]>
<![CDATA[Find out what is using your swap]]> 2011-05-27T16:46:40+01:00 http://northernmost.org/blog//find-out-what-is-using-your-swap/find-out-what-is-using-your-swap Have you ever logged in to a server, ran free, seen that a bit of swap is used and wondered what’s in there? It’s usually not very indicative of anything, or even overly helpful knowing what’s in there, mostly it’s a curiosity thing.

Either way, starting from kernel 2.6.16, we can find out using smaps which can be found in the proc filesystem. I’ve written a simple bash script which prints out all running processes and their swap usage. It’s quick and dirty, but does the job and can easily be modified to work on any info exposed in /proc/$PID/smaps If I find the time and inspiration, I might tidy it up and extend it a bit to cover some more alternatives. The output is in kilobytes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/bash
# Get current swap usage for all running processes
# Erik Ljungstrom 27/05/2011
SUM=0
OVERALL=0
for DIR in `find /proc/ -maxdepth 1 -type d | egrep "^/proc/[0-9]"` ; do
        PID=`echo $DIR | cut -d / -f 3`
        PROGNAME=`ps -p $PID -o comm --no-headers`
        for SWAP in `grep Swap $DIR/smaps 2>/dev/null| awk '{ print $2 }'`
        do
                let SUM=$SUM+$SWAP
        done
        echo "PID=$PID - Swap used: $SUM - ($PROGNAME )"
        let OVERALL=$OVERALL+$SUM
        SUM=0

done
echo "Overall swap used: $OVERALL"

This will need to be ran as root for it to be able to gather accurate numbers. It will still work even if you don’t, but it will report 0 for any processes not owned by your user. Needless to say, it’s Linux only. The output is ordered alphabetically according to your locale (which admittedly isn’t a great thing since we’re dealing with numbers), but you can easily apply your standard shell magic to the output. For instance, to find the process with most swap used, just run the script like so:

1
$ ./getswap.sh | sort -n -k 5

Don’t want to see stuff that’s not using swap at all?

1
$ ./getswap.sh  | egrep -v "Swap used: 0" |sort -n -k 5

…; and so on and so forth

]]>
<![CDATA[Example using Cassandra with Thrift in C++]]> 2011-05-21T20:09:46+01:00 http://northernmost.org/blog//example-using-cassandra-with-thrift-in-c-plus-plus/example-using-cassandra-with-thrift-in-c-plus-plus Due to a very exciting, recently launched project at work, I’ve had to interface with Cassandra through C++ code. As anyone who has done this can testify, the API docs are vague at best, and there are very few examples out there. The constant API changes between 0.x versions and the fact that the Cassandra API has its docs and Thrift has its own, but there is nothing bridging the two isn’t helpful either. So at the moment it is very much a case of dissecting header files and looking at implementation in the Thrift generated source files.

The only somewhat useful example of using Cassandra with C++ one can find online is this, but due to the API changes, this is now outdated (it’s still worth a read).

So in the hope that nobody else will have to spend the better part of a day piecing things together to achieve even the most basic thing, here’s an example which works with Cassandra 0.7 and Thrift 0.6.

First of all, create a new keyspace and a column family, using cassandra-cli:

1
2
3
4
5
6
7
8
9
10
11
[default@unknown] create keyspace nm_example;
c647b2c0-83e2-11e0-9eb2-e700f669bcfc
Waiting for schema agreement...
... schemas agree across the cluster
[default@unknown] use nm_example;
Authenticated to keyspace: nm_example
[default@nm_example] create column family nm_cfamily with comparator=BytesType and default_validation_class=BytesType;
30466721-83e3-11e0-9eb2-e700f669bcfc
Waiting for schema agreement...
... schemas agree across the cluster
[default@nm_example]

Now go to the directory where you have cassandra installed and enter the interface/ directory and run: thrift -gen cpp cassandra.thrift This will create the gen-cpp/ directory. From this directory, you need to copy all files bar the Cassandra_server.skeleton.cpp one to wherever you intend to keep your sources. Here’s some example code which inserts, retrieves, updates, retrieves and deletes keys:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
#include "Cassandra.h"

#include <protocol/TBinaryProtocol.h>
#include <thrift/transport/TSocket.h>
#include <thrift/transport/TTransportUtils.h>

using namespace std;
using namespace apache::thrift;
using namespace apache::thrift::protocol;
using namespace apache::thrift::transport;
using namespace org::apache::cassandra;
using namespace boost;

static string host("127.0.0.1");
static int port= 9160;

int64_t getTS(){
    /* If you're doing things quickly, you may want to make use of tv_usec
     * or something here instead
     */
    time_t ltime;
    ltime=time(NULL);
    return (int64_t)ltime;

}

int main(){
    shared_ptr<TTransport> socket(new TSocket(host, port));
    shared_ptr<TTransport> transport(new TFramedTransport(socket));
    shared_ptr<TProtocol> protocol(new TBinaryProtocol(transport));
    CassandraClient client(protocol);

    const string&#038; key="your_key";

    ColumnPath cpath;
    ColumnParent cp;

    ColumnOrSuperColumn csc;
    Column c;

    c.name.assign("column_name");
    c.value.assign("Data for our key to go into column_name");
    c.timestamp = getTS();
    c.ttl = 300;

    cp.column_family.assign("nm_cfamily");
    cp.super_column.assign("");

    cpath.column_family.assign("nm_cfamily");
    /* This is required - thrift 'feature' */
    cpath.__isset.column = true;
    cpath.column="column_name";

    try {
        transport->open();
        cout << "Set keyspace to 'dpdns'.." << endl;
        client.set_keyspace("nm_example");

        cout << "Insert key '" << key << "' in column '" << c.name << "' in column family '" << cp.column_family << "' with timestamp " << c.timestamp << "..." << endl;
        client.insert(key, cp, c, org::apache::cassandra::ConsistencyLevel::ONE);

        cout << "Retrieve key '" << key << "' from column '" << cpath.column << "' in column family '" << cpath.column_family << "' again..." << endl;
        client.get(csc, key, cpath, org::apache::cassandra::ConsistencyLevel::ONE);
        cout << "Value read is '" << csc.column.value << "'..." << endl;

        c.timestamp++;
        c.value.assign("Updated data going into column_name");
        cout << "Update key '" << key << "' in column with timestamp " << c.timestamp << "..." << endl;
        client.insert(key, cp, c, org::apache::cassandra::ConsistencyLevel::ONE);

        cout << "Retrieve updated key '" << key << "' from column '" << cpath.column << "' in column family '" << cpath.column_family << "' again..." << endl;
        client.get(csc, key, cpath, org::apache::cassandra::ConsistencyLevel::ONE);
        cout << "Updated value is: '" << csc.column.value << "'" << endl;

        cout << "Remove the key '" << key << "' we just retrieved. Value '" << csc.column.value << "' timestamp " << csc.column.timestamp << " ..." << endl;
        client.remove(key, cpath, csc.column.timestamp, org::apache::cassandra::ConsistencyLevel::ONE);

        transport->close();
    }
    catch (NotFoundException &#038;nf){
        cerr << "NotFoundException ERROR: "<< nf.what() << endl;
    }
    catch (InvalidRequestException &#038;re) {
        cerr << "InvalidRequest ERROR: " << re.why << endl;
    }
    catch (TException &#038;tx) {
        cerr << "TException ERROR: " << tx.what() << endl;
    }

    return 0;
}

Say we’ve called the file cassandra_example.cpp, and you have the files mentioned above in the same directory, you can comile things like this:

1
2
3
4
5
6
7
8
9
10
$ g++ -lthrift -Wall  cassandra_example.cpp cassandra_constants.cpp Cassandra.cpp cassandra_types.cpp -o cassandra_example
$ ./cassandra_example
Set keyspace to 'nm_example'..
Insert key 'your_key' in column 'column_name' in column family 'nm_cfamily' with timestamp 1306008338...
Retrieve key 'your_key' from column 'column_name' in column family 'nm_cfamily' again...
Value read is 'Data for our key to go into column_name'...
Update key 'your_key' in column with timestamp 1306008339...
Retrieve updated key 'your_key' from column 'column_name' in column family 'nm_cfamily' again...
Updated value is: 'Updated data going into column_name'
Remove the key 'your_key' we just retrieved. Value 'Updated data going into column_name' timestamp 1306008339 ...

Another thing worth mentioning is Padraig O'Sullivan’s libcassandra, which may or may not be worth a look depending on what you want to do and what versions of Thrift and Cassandra you’re tied to.

]]>
<![CDATA[Site slow after scaling out? Yeah, possibly!]]> 2011-03-29T06:03:46+01:00 http://northernmost.org/blog//site-slow-after-scaling-out-yeah-possibly/site-slow-after-scaling-out-yeah-possibly Every now and then, we have customers who outgrow their single server setup. The next natural step is of course splitting the web layer from the DB layer. So they get another server, and move the database to that.

So far so good! A week or so later, we often get the call “Our page load time is higher now than before the upgrade! We’ve got twice as much hardware, and it’s slower! You have broken it!” It’s easy to see where they’re coming from. It makes sense, right?

That is until you factor in the newly introduced network topology! Today it’s not unusual (that’s not to say it’s acceptable or optimal) for your average wordpress/drupal/joomla/otherspawnofsatan site to run 40-50 queries per page load. Quite often even more!

Based on a tcpdump session of a reasonably average query (if there is such a thing), connecting to a server, authenticating, sending a query and receiving a 5 row result set of 1434 bytes yields 25 packets being sent between my laptop and a remote DB server on the same wired, non-congested network. A normal, average latency of TCP/IP over Ethernet is ~0.2 ms for the size of packets we’re talking here. So, doing the maths, you’re seeing 25*0.2*50=250ms in just network latency per page load for your SQL queries. This is obviously a lot more than you see over a local UNIX socket.

This is inevitable, laws of physics. It is nothing you, your sysadmin and/or your hosting company can do anything about. There may however be something your developer can do about the amount of queries! You also shouldn’t confuse response-times with availability. Your response times may be slower, but you can (hopefully) serve a lot more users with this setup!

Sure, there are technologies out there which have considerably less latency than ethernet, but they come with quite the price-tag, and there are more often than not quite a few avenues to go down before it makes sense to start looking at that kind of thing.

You could also potentially looking at running the full stack on both machines using master/master replication for your DBs, and load balance your front-ends and have them both read locally, but only write to one node at a time! That kind of DB scenario is something fairly easily set up using mmm for MySQL. But in my experience, this often ends up more costly and potentially introducing more complexities than it solves. I’m an avid advocate for keeping server roles separate as much as possible!

]]>
<![CDATA[A look at mysql-5-5 semi-synchronous replication]]> 2010-10-09T20:19:54+01:00 http://northernmost.org/blog//a-look-at-mysql-5-5-semi-synchronous-replication/a-look-at-mysql-5-5-semi-synchronous-replication Now that MySQL 5.5 is in RC, I decided to have a look at the semi synchronous replication. It’s easy to get going, and from my very initial tests appear to be working a treat.

This mode of replication is called semi synchronous due to the fact that it only guarantees that at least one of the slaves have written the transaction to disk in its relay log, not actually committed it to its data files. It guarantees that the data exists by some means somewhere, but not that it’s retrievable through a MySQL client.

Semi sync is available as a plugin, and if you compile from source, you’ll need to do –with-plugins=semisync…. So far, the semisync plugin can only be built as a dynamic module, so you’ll need to install it once you’ve got your instance up and running. To do this, you do as with any other plugin:

1
2
install plugin rpl_semi_sync_master soname 'semisync_master.so';
install plugin rpl_semi_sync_slave soname 'semisync_slave.so';

You might get an 1126 error and a message saying “Can’t open shared library..”, then you most likely need to set the plugin_dir variable in my.cnf and give MySQL a restart. If you’re using a master/slave pair, you obviously won’t need to load both modules as above. You load the slave one on your slave, and the master one on your master. Once you’ve done this, you’ll have entries for these modules in the mysql.plugin table. When you have confirmed that you do, you can safely add the pertinent variables to your my.cnf, the values I used (in addition to the normal replication settings) for my master/master sandboxes were:

1
2
3
4
5
6
7
plugin_dir=/opt/mysql-5.5.6-rc/lib/mysql/plugin/
rpl_semi_sync_master_enabled=1
rpl_semi_sync_master_timeout=10000
rpl_semi_sync_slave_enabled=1
rpl_semi_sync_master_trace_level=64
rpl_semi_sync_slave_trace_level=64
rpl_semi_sync_master_wait_no_slave=1

Note that you probably won’t want to use these values for _trace_level in production due to the verbosity in the log! I just enabled these while testing. Also note that the timeout is in milliseconds. You can also set these on the fly with SET GLOBAL (thanks Oracle!), just make sure the slave is stopped before doing this, as it needs to be enabled during the handshake with the master for the semisync to kick in.

The timeout is the amount of time the master will lock and wait for a slave to acknowledge the write before giving up on the whole idea of semi synchronous operation and continue as normal. If you want to monitor this, you can use the status variable Rpl_semi_sync_master_status which is set to Off when this happens. If this condition should be avoided altogether, you would need to set a large enough value for the timeout and a low enough monitoring threshold as there doesn’t seem to be a way to force MySQL to wait forever for a slave to appear.

If you’re running an automated failover setup, you’ll want to set the timeout higher than your heartbeat, so ensuring no committed data is lost. Then you might also want to set the timeout considerably lower initially on the passive master so that you don’t end up waiting on the master we know is unhealthy and have just failed over from.

Before implementing this in production, I would strongly recommend running a few performance tests against your setup as this will slow things down considerably for some workloads. Each transaction has to be written to the binlog, read over the wire and written to the relay log, and then lastly flushed to disk before each DML statement returns. You will almost definitely benefit in batching up queries into larger transactions rather than using the default auto commit mode as this will increase the frequency of the steps. Update: Even though the manual clearly states that the event has to be flushed to disk, this doesn’t actually appear to be the case (see comments). The above still stands, but the impact may not be as great as first thought

When I find the time, I will run some benchmarks on this.

Lastly, please note that this is written while MySQL 5.5 is still in release candidate stage, so while unlikely, things are subject to change. So please be mindful of this in future comments.

]]>
<![CDATA[GlusterFS init script and Puppet]]> 2010-08-09T08:08:14+01:00 http://northernmost.org/blog//glusterfs-init-script-and-puppet/glusterfs-init-script-and-puppet The other day I had quite the head scratcher. I was setting up a new environment for a customer which included the usual suspects in a LAMP stack spread across a few virtual machines in an ESXi cluster. As the project is quite volatile in terms of requirements, amount of servers, server roles, location etc. I decided to start off using Puppet to make my life easier further down the road.

I got most of it set up, and got started on writing up the glusterfs Puppet module. Fairly straight forward, a few directories, configuration files and a mount point. Then I came to the Service declaration, and of course we want this to be running at all times, so I went on and wrote:

1
2
3
4
5
6
service { "glusterfsd":
    ensure => running,
    enable => true,
    hasrestart => true,
    hasstatus => true,
}

expecting glusterfsd to be running shortly after I purposefully stopped it. But it didn’t. So I dove into puppet (Yay Ruby!) and deduced that the way it determines whether something is running or not is the return code of: /sbin/service servicename status

So a quick look in the init script which ships with glusterfs-server shows that it calls the stock init function “status” on glusterfsd, which is perfectly fine, but then it doesn’t exit with the return code from this function, it simply runs out of scope and exits with the default value of 0.

So to get around this, I made a quick change to the init script and used the return code from the “status” function (/etc/rc.d/init.d/functions on RHEL5) and exited with $?, and Puppet had glusterfsd running within minutes.

I couldn’t find anything when searching for this, so I thought I’d make a note of it here.

]]>
<![CDATA[Legitimate emails being dropped by Spamassassin in RHEL5]]> 2010-05-26T19:05:34+01:00 http://northernmost.org/blog//legitimate-emails-being-dropped-by-spamassassin-in-rhel5/legitimate-emails-being-dropped-by-spamassassin-in-rhel5 ver the past few months, an increasing number of customers have complained that their otherwise OK spam filters have started dropping an inordinate amount of legitimate emails. The first reaction is of course to increase the score required to be filtered, but that just opens up for more spam. I looked in the quarantine on one of these servers, and ran a few of the legitimate ones through spamassassin in debug mode. I noticed one particular rule which was prevalent in the vast majority of the emails. Here’s an example:

1
2
3
4
5
...
[2162] dbg: learn: initializing learner
[2162] dbg: check: is spam? score=4.004 required=6
[2162] dbg: check: tests=FH_DATE_PAST_20XX,HTML_MESSAGE,SPF_HELO_PASS
...

4 is obviously quite a high score for an email whose only flaw is being in HTML. But FH_DATE_PAST_20XX caught my eye in all of the outputs. So to the rule files:

1
2
3
4
5
$ grep FH_DATE_PAST_20XX /usr/share/spamassassin/72_active.cf
##{ FH_DATE_PAST_20XX
header   FH_DATE_PAST_20XX      Date =~ /20[1-9][0-9]/ [if-unset: 2006]
describe FH_DATE_PAST_20XX      The date is grossly in the future.
##} FH_DATE_PAST_20XX

Aha. This is a problem. With 50_scores.cf containing this:

1
2
$ grep FH_DATE_PAST /usr/share/spamassassin/50_scores.cf
score FH_DATE_PAST_20XX 2.075 3.384 3.554 3.188 # n=2

there’s no wonder emails are getting dropped! I guess this is a problem one can expect when running a distribution with packages 6 years old and neglect to frequently (or at least every once in a while) update the rules!

Luckily, this rule is gone altogether from RHEL6’s version of spamassassin.

]]>
<![CDATA[Control-groups in rhel6]]> 2010-05-13T09:26:51+01:00 http://northernmost.org/blog//control-groups-in-rhel6/control-groups-in-rhel6 One new feature that I’m very enthusiastic about in RHEL6 is Control Groups (cgroup for short). It allows you to create groups and allocate resources to these. You can then bunch your applications into groups at your heart’s content.

It’s relatively simple to set up, and configuration can be done in two different ways. You can use the supplied cgset command, or if you’re accustomed to doing it the usual way when dealing with kernel settings, you can simply echo values into the pseudo-files under the control group.

Here’s a controlgroup in action:

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@rhel6beta cgtest]# grep $$ /cgroup/gen/group1/tasks
1138
[root@rhel6beta cgtest]# cat /cgroup/gen/group1/memory.limit_in_bytes
536870912
[root@rhel6beta cgtest]# gcc alloc.c -o alloc && ./alloc
Allocating 642355200 bytes of RAM,,,
Killed
[root@rhel6beta cgtest]# echo `echo 1024*1024*1024| bc` >
/cgroup/gen/group1/memory.limit_in_bytes
[root@rhel6beta cgtest]# ./alloc
Allocating 642355200 bytes of RAM,,,
Successfully allocated 642355200 bytes of RAM, captn' Erik...
[root@rhel6beta cgtest]#

The first line shows that the shell which launches the app is under the control of the cgroup group1, so subsequently all it’s child processes are subject to the same restrictions.

As you can also see, the initial memory limit in the group is 512M. Alloc is a simple C app I wrote which calloc()s 612M of RAM (for demonstrative purposes, I’ve disabled swap on the system altogether). At the first run, the kernel kills the process in the same way it would if the whole system had run out of memory. The kernel message also indicates that the control group ran out of memory, and not the system as a whole:

1
2
3
4
...
May 13 17:56:20 rhel6beta kernel: Memory cgroup out of memory: kill process
1710 (alloc) score 9861 or a child
May 13 17:56:20 rhel6beta kernel: Killed process 1710 (alloc)

Unfortunately it doesn’t indicate which cgroup the process belonged to. Maybe it should?

cgroups doesn’t just give you the ability to limit the amount of RAM, it has a lot of tuneables. You can even set swappiness on a per-group basis! You can limit the devices applications are allowed to access, you can freeze processes as well as tag outgoing network packets with a class ID, in case you want to do shaping or profiling on your network! Perfect if you want to prioritise SSH traffic over anything else, so you can comfortably worked even when your uplink is saturated. Furthermore, you can easily get an overview of memory usage, CPU accounting etc. of applications in any given group.

All this means you can clearly separate resources and to quite a large extent ensure that some applications won’t starve the whole system, or each other from resources. Very handy, no more waiting for half an hour for the swap to fill up and OOM to kick (and often chose the wrong PID) in when customer’s applications have run astray.

A much welcomed addition to RHEL!

]]>
<![CDATA[boot loader not installed in rhel6 beta]]> 2010-04-10T12:01:52+01:00 http://northernmost.org/blog//boot-loader-not-installed-in-rhel6-beta/boot-loader-not-installed-in-rhel6-beta Just a heads up I thought I’d share in the hope that it’ll save someone some time, when installing RHEL6 beta under Xen, be aware that pygrub currently can’t handle /boot being on ext4 (which is the default). So in order to run rhel6 under xen, ensure that you modify the partition layout during the installation process.

This turned out to be a real head scratcher for me, and initially I thought the problem was something else as Xen wasn’t being very helpful with error messages.

Hopefully there’ll be an update for this soon!

]]>
<![CDATA[building hiphop-php gotcha]]> 2010-02-21T11:17:51+00:00 http://northernmost.org/blog//building-hiphop-php-gotcha/building-hiphop-php-gotcha Tonight I’ve delved into the world of Facebook’s HipHop for PHP. Let me early on point out that I’m not doing so because I believe that I will need it any time soon, but I am convinced that I without a shadow of a doubt will be approached by customers who think they do, and I rather not have opinions or advise against things I haven’t tried myself or at least have a very good understanding of.

Unfortunately I set about this task on an RHEL 5.4 box, and it hasn’t been a walk in the park. Quite a few dependencies were out of date or didn’t exist in the repositories, libicu, boost, onig, tbb etc.

Though, CMake did a good job of telling me what was wrong, so it wasn’t a huge deal, I just compiled the missing pieces from source and put them in $CMAKE_PREFIX_PATH. One thing CMake didn’t pick up on however, was that the flex version shipped with current RHEL is rather outdated. Once I thought I had everything configured, I set about the compilation, and my joy was swiftly abrupted by this:

1
2
3
[  3%] [FLEX][XHPScanner] Building scanner with flex /usr/bin/flex version
2.5.4
/usr/bin/flex: unknown flag '-'.  For usage, try /usr/bin/flex --help

Not entirely sure what it was actually doing here, I took the shortcut of replacing /usr/bin/flex with a shell script which just exited after putting $@ in a file in /tmp/ and re-ran make. Looking in the resulting file, this is the argument flex is given:

1
2
3
-C --header-file=scanner.lex.hpp
-o/home/erik/dev/hiphop-php/src/third_party/xhp/xhp/scanner.lex.cpp
/home/erik/dev/hiphop-php/src/third_party/xhp/xhp/scanner.l

To me that looks quite valid, and there’s certainly no single – in that command line.

Long story short, flex introduced –header-file in a relatively “recent” version (2.5.33 it seems, but I may be wrong on that one, doesn’t matter). Unlike most other programs (using getopt), it won’t tell you Invalid option ‘–header-file’. So after compiling a newer version of flex, I was sailing again.

]]>
<![CDATA[Development; just as important as dual nics]]> 2010-02-13T17:38:56+00:00 http://northernmost.org/blog//development-just-as-important-as-dual-nics/development-just-as-important-as-dual-nics There is a popular saying which I find you can apply to most things in life; “You get what you pay for”. Sadly, this does not seem to apply for software development in any way. You who know me know that I work for a reasonably sized hosting company in the upper market segment. We have thousands of servers and hundreds of customers, so after a while you get a quite decent overview of how things work and a vast arsenal of “stories from the trenches”.

So here’s a small tip; ensure that your developers know what they are doing! It will save you a lot of hassle and money in the long run.

Without having made a science out of it, I can confidently say that at the very least 95% of the downtime I see on a daily basis is due to faulty code in the applications running on our servers.

So after you’ve demanded dual power feeds to your rack, bonded NICs and a gazillion physical paths to your dual controller SAN, it would make sense to apply the same attitude towards your developers. After all, they are carbon based humans and are far more likely to break than your silicon NIC. Now unfortunately it is not as simple as “if I pay someone a lot of money and let them do their thing, I will get good solid code out of it”, so a great deal of due diligence is required in this part of your environment as well. I have seen more plain stupid things coming from 50k pa. people than I care to mention, and I have seen plain brilliant things coming out of college kids’ basements.

This is important not only from an availability point of view, it’s also about running cost. The amount of hardware in our data centers which is completely redundant, and would easily be made obsolete with a bit of code and database tweaking is frightening. So you think you may have cut a great deal when someone said they could build your e-commerce system in 3 months for 10k less than other people have quoted you. But in actual fact, all you have done is got someone to effectively re-brand a bloated, way too generic, stock framework/product which the developer has very little insight into and control over. Yes, it works if you “click here, there and then that button”, the right thing does appear on the screen. But only after executing hundreds of SQL queries, looking for your session in three different places, done four HTTP redirects, read five config files and included 45 other source files. Needless to say, those one-off 10k you think you have saved, will be swallowed in recurring hardware cost in no time. You have probably also severely limited your ability to scale things up in the future.

So in summary, don’t cheap out on your development but at the same time don’t think that throwing money at people will make them write good code. Ask someone else to look things over every now and then, even if it will cost you a little bit. Use the budget you were planning on spending on the SEO consultant. Let it take time.

]]>
<![CDATA[GlusterFS tcp_nodelay patch update]]> 2009-06-29T06:06:34+01:00 http://northernmost.org/blog//glusterfs-tcp-nodelay-patch-update/glusterfs-tcp-nodelay-patch-update As mentioned in my previous post, I wrote a patch for GlusterFS to increase its performance when operating on many smaller files. Someone told me the other day that this functionality has been pushed to the git repository. Would have been good to have heard about this sooner…

So all of you who emailed me positive feedback and asked to make it a tuneable in the translator config (thanks!) - please check out  the above link to the git repository.

On another note, it seems as if they’re breaking away from having the protocol version bound to the release version, good progress in my opinion!

]]>
<![CDATA[Improving GlusterFS performance]]> 2009-06-04T20:06:08+01:00 http://northernmost.org/blog//improving-glusterfs-performance/improving-glusterfs-performance I’ve had a closer look at glusterfs in the last few days following the release of version 2.0.1. We often get customers approaching us with web apps dealing with user generated content which needs to be uploaded. If you have two or more servers in a load balanced environment, you usually have a few options, an NFS/CIFS share on one of them (single point of failure - failover NFS is, well…), a SAN (expensive), MogileFS (good, but alas not application agnostic),  periodically rsync/tar | nc files between the nodes (messy, not application agnostic and slow), store files in a database (not ideal for a number of reasons). There are a few other approaches and combinations of the above, but neither is perfect. GlusterFS solves this. It’s fast, instant and redundant! 

I’ve got four machines set up, two acting as redundant servers. Since they’re effectively acting as a RAID 1, each write is done twice over the wire, but that’s kind of inevitable. They’re all connected in a private isolated gigabit network. When dealing with larger files (a la cp yourfavouritedistro.iso /mnt/gluster) the throughput is really good at around 20-25 MB/s leaving the client. CPU usage on the client doing the copy was in the realms of 20-25% on a dual core. Very good so far! 

Then I tried many frequent filesystem operations, untarring the 2.6.9 linux kernel from and onto the mount.  Not so brilliant! It took 23-24 minutes from start to finish. The 2.6.9 kernel contain 17477 files and the average size is just a few kilobytes. This is obviously a lot of smaller bursts of network traffic!

After seeing this, I dove into the source code to have a look, when I reached the socket code, I realised that the performance for smaller files would probably be improved by a lot if Nagle’s algorithm was disabled on the socket. Said and done, I added a few setsockopt()s and went to test. The kernel tree now extracted in 1m 20s!

Of course there’s always the drawback.. In this case it is that larger files take longer to transfer as the raw throughput is decreasing (kernel buffer is a lot faster than a cat5!). Copying a 620 MB ISO from local disk onto the mount takes 1.20 s with the vanilla version of GlusterFS, and 3m 34s with Nagle’s algorithm disabled. 

I’m not seeing any performance hit on sustained transfer of larger files, but at the moment I’m guessing I’m hitting another bottleneck before that becomes a problem, as it “in theory” should have a slight negative impact in this case.

If you want to have a look at it, you can find the patch here. Just download to the source directory and do patch -p1 < glusterfs-2.0.1-patch-erik.diff  and then proceed to build as normal.

Until I’ve done some more testing on it and received some feedback, I won’t bother making it a tuneable in the vol-file just in case it’d be wasted effort!

]]>
<![CDATA[Don't fix, work around - MySQL]]> 2008-10-26T19:06:06+00:00 http://northernmost.org/blog//dont-fix-work-around-mysql/dont-fix-work-around-mysql I attended the MySQL EMEA conference last thursday where I enjoyed a talk from Ivan Zoratti titled “Scaling Up, Scaling Out, Virtualization – What should you do with MySQL?”

They have changed their minds quite a bit. Virtualisation in production is no longer a solid no-no according to them (a lot of people would argue). Solaris containers, anyone?

As most of us know by now, MySQL struggles to utilise multiple cores efficiently. This has been the case for quite some time by now, and people like Google and Percona has grown tired of waiting for MySQL to fix it.

Sun decided to not go down the route of reviewing and accepting the patches, but are now suggesting – are you sitting down? – running multiple instances on the same hardware. I’m not against this from a technical point of view as it currently actually does improve performance on multiple-core-multiple-disk systems (for an unpatched version) for some workloads, but the fact that they have gone to openly and officially suggest workarounds to their own problem rather than fixing the source of the problem is disturbing.

Granted, I suppose it makes sense to suggest larger boxes if you’ve been bought by a big-iron manufacturer. Also, I should be fair and note that Ivan at least didn’t say scaling out was a negative thing and that it’s still a good option.

If anyone asks me though, I think I’ll keep scaling outwards and use the more sensible version of MySQL

]]>