All things sysadmin

Sysadmin 101: Troubleshooting

2017-02-16T22:27:31+00:00

I typically keep this blog strictly technical, keeping observations, opinions and the like to a minimum. But this, and the next few posts will be about basics and fundamentals for starting out in system administration/SRE/system engineer/sysops/devops-ops (whatever you want to call yourself) roles more generally.
Bear with me!

“My web site is slow”

I just picked the type of issue for this article at random, this can be applied to pretty much any sysadmin related troubleshooting. It’s not about showing off the cleverest oneliners to find the most information. It’s also not an exhaustive, step-by-step “flowchart” with the word “profit” in the last box. It’s about general approach, by means of a few examples.
The example scenarios are solely for illustrative purposes. They sometimes have a basis in assumptions that doesn’t apply to all cases all of the time, and I’m positive many readers will go “oh, but I think you will find…” at some point.
But that would be missing the point.

Having worked in support, or within a support organization for over a decade, there is one thing that strikes me time and time again and that made me write this;
The instinctive reaction many techs have when facing a problem, is to start throwing potential solutions at it.

“My website is slow”

I’m going to try upping MaxClients/MaxRequestWorkers/worker_connections
I’m going to try to increase innodb_buffer_pool_size/effective_cache_size
I’m going to try to enable mod_gzip (true story, sadly)

“I saw this issue once, and then it was because X. So I’m going to try to fix X again, it might work”.

This wastes a lot of time, and leads you down a wild goose chase. In the dark. Wearing greased mittens.
InnoDB’s buffer pool may well be at 100% utilization, but that’s just because there are remnants of a large one-off report someone ran a while back in there. If there are no evictions, you’ve just wasted time.

Quick side-bar before we start

At this point, I should mention that while it’s equally applicable to many roles, I’m writing this from a general support system adminstrator’s point of view. In a mature, in-house organization or when working with larger, fully managed or “enterprise” customers, you’ll typically have everything instrumented, measured, graphed, thresheld (not even word) and alerted on. Then your approach will often be rather different. We’re going in blind here.

If you don’t have that sort of thing at your disposal;

Clarify and First look

Establish what the issue actually is. “Slow” can take many forms. Is it time to first byte? That’s a whole different class of problem from poor Javascript loading and pulling down 15 MB of static assets on each page load. Is it slow, or just slower than it usually is? Two very different plans of attack!

Make sure you know what the issue reported/experienced actually is before you go off and do something. Finding the source of the problem is often difficult enough, without also having to find the problem itself.
That is the sysadmin equivalent of bringing a knife to a gunfight.

Low hanging fruit / gimmies

You are allowed to look for a few usual suspects when you first log in to a suspect server. In fact, you should! I tend to fire off a smattering of commands whenever I log in to a server to just very quickly check a few things; Are we swapping (free/vmstat), are the disks busy (top/iostat/iotop), are we dropping packets (netstat/proc/net/dev), is there an undue amount of connections in an undue state (netstat), is something hogging the CPUs (top), is someone else on this server (w/who), any eye-catching messages in syslog and dmesg?

There’s little point to carrying on if you have 2000 messages from your RAID controller about how unhappy it is with its write-through cache.

This doesn’t have to take more than half a minute. If nothing catches your eye – continue.

Reproduce

If there indeed is a problem somewhere, and there’s no low hanging fruit to be found;

Take all steps you can to try and reproduce the problem. When you can reproduce, you can observe. When you can observe, you can solve. Ask the person reporting the issue what exact steps to take to reproduce the issue if it isn’t already obvious or covered by the first section.

Now, for issues caused by solar flares and clients running exclusively on OS/2, it’s not always feasible to reproduce. But your first port of call should be to at least try! In the very beginning, all you know is “X thinks their website is slow”. For all you know at that point, they could be tethered to their GPRS mobile phone and applying Windows updates. Delving any deeper than we already have at that point is, again, a waste of time.

Attempt to reproduce!

Check the log!

It saddens me that I felt the need to include this. But I’ve seen escalations that ended mere minutes after someone ran tail /var/log/.. Most *NIX tools these days are pretty good at logging. Anything blatantly wrong will manifest itself quite prominently in most application logs. Check it.

Narrow down

If there are no obvious issues, but you can reproduce the reported problem, great. So, you know the website is slow. Now you’ve narrowed things down to: Browser rendering/bug, application code, DNS infrastructure, router, firewall, NICs (all eight+ involved), ethernet cables, load balancer, database, caching layer, session storage, web server software, application server, RAM, CPU, RAID card, disks.
Add a smattering of other potential culprits depending on the set-up. It could be the SAN, too. And don’t forget about the hardware WAF! And.. you get my point.

If the issue is time-to-first-byte you’ll of course start applying known fixes to the webserver, that’s the one responding slowly and what you know the most about, right? Wrong!
You go back to trying to reproduce the issue. Only this time, you try to eliminate as many potential sources of issues as possible.

You can eliminate the vast majority of potential culprits very easily: Can you reproduce the issue locally from the server(s)? Congratulations, you’ve just saved yourself having to try your fixes for BGP routing.
If you can’t, try from another machine on the same network. If you can - at least you can move the firewall down your list of suspects, (but do keep a suspicious eye on that switch!)

Are all connections slow? Just because the server is a web server, doesn’t mean you shouldn’t try to reproduce with another type of service. netcat is very useful in these scenarios (but chances are your SSH connection would have been lagging this whole time, as a clue)! If that’s also slow, you at least know you’ve most likely got a networking problem and can disregard the entire web stack and all its components. Start from the top again with this knowledge (do not collect $200). Work your way from the inside-out!

Even if you can reproduce locally - there’s still a whole lot of “stuff” left. Let’s remove a few more variables. Can you reproduce it with a flat-file? If i_am_a_1kb_file.html is slow, you know it’s not your DB, caching layer or anything beyond the OS and the webserver itself.
Can you reproduce with an interpreted/executed hello_world.(py|php|js|rb..) file? If you can, you’ve narrowed things down considerably, and you can focus on just a handful of things. If hello_world is served instantly, you’ve still learned a lot! You know there aren’t any blatant resource constraints, any full queues or stuck IPC calls anywhere. So it’s something the application is doing or something it’s communicating with.

Are all pages slow? Or just the ones loading the “Live scores feed” from a third party?

What this boils down to is; What’s the smallest amount of “stuff” that you can involve, and still reproduce the issue?

Our example is a slow web site, but this is equally applicable to almost any issue. Mail delivery? Can you deliver locally? To yourself? To ? Test with small, plaintext messages. Work your way up to the 2MB campaign blast. STARTTLS and no STARTTLS. Work your way from the inside-out.

Each one of these steps takes mere seconds each, far quicker than implementing most “potential” fixes.

Observe / isolate

By now, you may already have stumbled across the problem by virtue of being unable to reproduce when you removed a particular component.

But if you haven’t, or you still don’t know why; Once you’ve found a way to reproduce the issue with the smallest amount of “stuff” (technical term) between you and the issue, it’s time to start isolating and observing.

Bear in mind that many services can be ran in the foreground, and/or have debugging enabled. For certain classes of issues, it is often hugely helpful to do this.

Here’s also where your traditional armory comes into play. strace, lsof, netstat, GDB, iotop, valgrind, language profilers (cProfile, xdebug, ruby-prof…). Those types of tools.

Once you’ve come this far, you rarely end up having to break out profilers or debugers though.

strace is often a very good place to start.
You might notice that the application is stuck on a particular read() call on a socket file descriptor connected to port 3306 somewhere. You’ll know what to do.
Move on to MySQL and start from the top again. Low hanging fruit: “Waiting_for * lock”, deadlocks, max_connections.. Move on to: All queries? Only writes? Only certain tables? Only certain storage engines?…

You might notice that there’s a connect() to an external API resource that takes five seconds to complete, or even times out. You’ll know what to do.

You might notice that there are 1000 calls to fstat() and open() on the same couple of files as part of a circular dependency somewhere. You’ll know what to do.

It might not be any of those particular things, but I promise you, you’ll notice something.

If you’re only going to take one thing from this section, let it be; learn to use strace! Really learn it, read the whole man page. Don’t even skip the HISTORY section. man each syscall you don’t already know what it does. 98% of troubleshooting sessions ends with strace.

Flask - cookie and token sessions simultaneously

2017-01-06T16:02:00+00:00

Dealing with sessions in Flask applications is rather simple! There is plenty of choice in pre-rolled implementations that is more or less plug-and-play.

However, sometimes you may want (or need) to colour outside the lines, where a cookie-cutter implentation either doesn’t work, or gets in the way.

One such scenario is if you have an application which needs to act both as a web front-end, with your typical cookie-based sessions, as well as an API endpoint. Requiring cookies when you’re acting as an API endpoint isn’t particularly nice, tokens in the request header is the way to go! So how can you get Flask sessions to work with both these methods of identification?

Perhaps at this point, I should add that you might be best served by reconsidering your strategy here, and make the API endpoint a distinct application from the one driving your UI. You can still share all your code for your models and logic and can even make use of a layer 7 load balancer to deal with the separation for you. But be it due to retrofitting, time constraint, legacy or otherwise imposed design.. here goes;

Since Flask is a pretty lightweight framework, it’s easily extended or wrestled into submission. Luckily for us, it offers a pluggable way to write your own session handling!

I’ve put a small example application with a custom session interface on GitHub, which allows what we’ve previously discussed. You can either distinguish sessions by a cookie, if present, or a header of your chosing (cookie trumps header, if both are present). This header defaults to the de-facto standard X-Auth-Header in the example, but you can configure this easily. For ease of use, the datastore used to store the sessions is memcached. But it’s very easily replaced by any other datastore.

The example is as small and compact as possible while remaining runnable. There are no “bells and whistles” such as actual authentication, that’s for you to handle outside of the session handler. You will also most likely want to extend the error checking and handling.

Do note - there’s a docker-compose file included in the repository, which will enable you to quickly get up and running. Alternatively you can simply run pip install -r requirements.txt && ./runserver.py from within the app/ directory, provided that you have the required system dependencies.

Here’s an example of using this session handler with cookies:

$ docker-compose up -d
# -d optional, leave it off to run in the foreground

$ http http://localhost:9000
HTTP/1.0 200 OK
Content-Length: 64
Content-Type: text/html; charset=utf-8
Date: Fri, 06 Jan 2017 20:22:38 GMT
Server: Werkzeug/0.11.15 Python/2.7.6
Set-Cookie: session=e53941b4-dc32-4e30-902a-a197cd1140b5; Expires=Fri,
06-Jan-2017 20:23:08 GMT; HttpOnly; Path=/

The random identifier stored with your session is: 05ed02a2-48ef-4c5a-8588-9a87356ddad9

$ http -b http://localhost:9000 Cookie:session=e53941b4-dc32-4e30-902a-a197cd1140b5 
The random identifier stored with your session is: 05ed02a2-48ef-4c5a-8588-9a87356ddad9

Since we don’t send a JSON body containing the key token, or set the X-Auth-Token header, the session handler determines the application should send a cookie.

The example has a session timeout of a mighty 30 seconds (configurable, obviously).

Now, if we were to behave like an API, on the other hand:

$ echo '{ "token": "pretend_token"}' | http  --json POST http://localhost:9000
HTTP/1.0 200 OK
Content-Length: 64
Content-Type: text/html; charset=utf-8
Date: Fri, 06 Jan 2017 17:56:33 GMT
Server: Werkzeug/0.11.15 Python/2.7.6

The random identifier stored with your session is: d4945e3e-21bc-42db-9b1b-a0c941a25ddb

$ http -b http://localhost:9000 x-auth-token:pretend_token
The random identifier stored with your session is: d4945e3e-21bc-42db-9b1b-a0c941a25ddb

$ sleep 30
$ http -b http://localhost:9000 x-auth-token:pretend_token
The random identifier stored with your session is: 7095b0eb-1efa-4b75-b9e2-a02c7f6e837b

As you can see, we don’t get a cookie sent back, because we behaved like an API client. We can also see that we get a brand new session after the 30 seconds has elapsed.

The example also comes with a test suite for verification. You can execute this by simply running make tests:

$ make tests
docker-compose exec session_example /bin/bash -c "cd /app ; python -m unittest discover -s tests/"
Previously unseen session... Setting identifier
Previously unseen session... Setting identifier
.Previously unseen session... Setting identifier
Previously unseen session... Setting identifier
..Previously unseen session... Setting identifier
...Previously unseen session... Setting identifier
...
----------------------------------------------------------------------
Ran 9 tests in 0.027s

OK

The tests all run in a docker container, so the first time you run it, you’ll most likely see an image being built, and a memcached image being pulled.

Hope this helps someone!

Swap usage - 5 years later

2016-12-19T14:28:04+00:00

Skip to the end for a TL;DR

I’ve neglected this blog a bit in the last year or so. I’ve written a lot of documentation and given a lot of training internally at work, so there hasn’t been an enormous amount of time I’ve been able or willing to spend on it. However;

Five years ago, I wrote an article presenting a script which lets you find what processes have pages in your swap memory, and how much they consume. This article is still by far the most popular one I’ve ever written, and it still sees a fair amount of traffic, so I wanted to write a bit of an updated version, and fill in some of the things I probably should have mentioned more in depth in the original one.

Let’s get one thing out of the way first - the script in the original article is now redundant. It actually already was redundant in certain cases when I wrote about it, but now it’s really redeundant. To get the same information in any currently supported distribution, simply launch top, press f, scroll down to where it says SWAP press space followed by q. There we go, script redundant.

With that cleared up - what I didn’t mention five years ago is why you would want to know this. The short answer is; you probably don’t! At least not for the reasons most people seem to want to.

When it comes to swap, the Linux virtual memory manager doesn’t really deal with ‘programs’. It only deals with pages of memory (commonly 4096 bytes, see getconf PAGE_SIZE to find out what your system uses).

Instead, when memory is under pressure and the kernel need to decide what pages to commit to swap, it will do so according to an LRU (Least Recently Used) algorithm. Well, that’s a gross oversimplification. There is a lot of magic going on there which I won’t pretend to know in any greater detail. There is also mlock() and madvise() which developers can use to influence these things.

But in essence - the VMM will amongst other things deliberately put pages of memory which are infrequently used to swap, to ensure that frequently accessed memory pages are kept in the much faster, residential memory.

So if you landed on my original article, wanting to find out what program was misbehaving by seeing what was in swap, like many appear to have done, please reconsider your troubleshooting strategy! Besides, a memory page being in swap isn’t a problem in and of itself - it’s only when those pages are shifted from RAM onto disk and back that you experience a slowdown. Pages that were swapped out once, and remain there for the lifetime of the process it belongs to doesn’t contribute to any noticeable issues.

It’s also worth noting, that once a page has been swapped out to disk, it will not be swapped back into RAM again until it need accessing by the process. Therefore, pages being in swap may be an indicator of a previous memory pressure issue, rather than one currently in progress.

So how do you know if you’re “swapping” ?

Most of the time you can just tell, you’ll notice. But for more modest swapping; the command sar -B 1 20 will print out some statistics each second for 20 seconds. If you observe the second and thrid column, you will see how many pages you swap in and out respectively. If this number is 0 or near 0, you’re unlikely to notice any issues due to swapping. Not everyone has sar installed - so another command you can run is vmstat 1 20 and look at the si and so columns for Swap In and Swap Out.

So in summary;

top can show you how much swap a process is using
A process using swap isn’t necessarily (or even is rarely) a badly behaved process
Swap isn’t inherently bad, it’s only bad when it’s used frequently
The presence of pages in swap doesn’t necessarily indicate a current memory resource issue

sar rebooting ubuntu

2015-10-30T17:27:45+00:00

Today I had a colleague approach me about a oneliner I sent him many months ago, saying that it kept rebooting a server he was running it on.

It was little more than running sar in a loop, extract some values and run another command if certain thresholds were exceeded. Hardly anything that you’d think would result in a reboot.

After whittling down the oneliner to the offending command, it turned out that sar was the culprit. Some further debugging revealed that sar merely spawns a process called sadc, which does the actual heavy lifting.

In certain circumstances, if you send SIGINT (ctrl+c, for example) to sar, it can exit before sadc has done its thing.
When that happens, sadc becomes an orphan, and /sbin/init being a good little init system, takes it under its wing and becomes its parent process.

When sadc receives the SIGINT signal, it’s signal handler will pass it up to its parent process… You see where this is going, right?
Yep, /sbin/init gets the signal, and does what it should do. Initiates a reboot.

If you want to reboot an Ubuntu 14.x server, simply run this in a terminal (as root, this is NOT a DoS/vulnerability, merely a bug):

root@elcrashtest:~# echo $(sar -b 1 5)
^C
root@elcrashtest:~# ^C
root@elcrashtest:~#
Broadcast message from root@elcrashtest
    (unknown) at 18:06 ...

    The system is going down for reboot NOW!
    Control-Alt-Delete pressed

Rapidly hitting ctrl+c twice does the trick.
Obviously this command doesn’t make sense to run in isolation, but the bug was hit in the context of a more involved oneliner, and being in a subprocess seem to trigger it more often. You may need to run it a couple of times as a few things need to line up for it to happen. The above command reboots the server like 8-9/10 times.

If executed in another subshell, you only need to hit ctrl+c once to trigger it.

A more unrealistic, but sure-fire way to trigger it looks like this:

root@elcrashtest:~# sar -b 1 100 > /dev/null &
[1] 3777
root@elcrashtest:~# kill -SIGKILL $! ; kill -SIGINT $(pidof sadc);
Broadcast message from root@elcrashtest
...

Basically killing sar forcefully (thus orphaning sadc), and then send SIGINT to sadc. This has a 100% success rate.

This was fixed in 2014, but Canonical has neglected to backport it.
A colleague of mine, who is a much better OSS citizen than I am, has raised this with Canonical

I only tested this on Ubuntu 14.04 and 14.10. Debian and RedHat/CentOS does not appear to suffer from this. It’s surprising that it’s still present in Ubuntu Trusty, since this is backported in Debian Jessie.

Only on a Friday afternoon…

GRE tunnels and UFW

2015-09-14T19:17:10+01:00

Today I wrote an Ansible playbook to set up an environment for a docker demo I will be giving shortly. In the demo I will be using three hosts, and I want the containers to be able to speak to each other across hosts. To this end, I’m using Open vSwitch. The setup is quite straight forward, set up the bridge, get the meshed GRE tunnels up and off you go.
I first set this up in a lab, with firewalls disabled. But knowing that I will give the demo on public infrastructure, I still wrote the play to allow everything on a particular interface (an isolated cloud-network) through UFW.
When I ran my playbook against a few cloud servers, the containers couldn’t talk to each other on account of the GRE tunnels not working.

So I enabled logging in UFW, and soon started seeing these types of entries

[UFW BLOCK] IN=eth2 OUT= MAC=
SRC= DST= LEN=76 TOS=0x00 PREC=0x00 TTL=64 ID=36639 DF
PROTO=47

Upon checking which rule actually dropped the packets (iptables -L -nv), it transpired that the culprit was

1    97 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0
ctstate INVALID

It turns out that a change in the 3.18 kernel and onwards means that unless either of the nf_conntrack_pptp or nf_conntrack_proto_gre modules are loaded, any GRE packets will be marked as INVALID, as opposed to NEW and subsequently ESTABLISHED.

So in order to get openvswitch working with UFW, there are two solutions; Either explicitly allow protocol 47, or load one of the aforementioned kernel modules.

Should you go for the former solution, this is the rule you need to beat to the punch:

$ grep -A 2 "drop INVALID" /etc/ufw/before.rules
# drop INVALID packets (logs these in loglevel medium and higher)
-A ufw-before-input -m conntrack --ctstate INVALID -j ufw-logging-deny
-A ufw-before-input -m conntrack --ctstate INVALID -j DROP

with -A ufw-before-input -p 47 -i $iface -j ACCEPT

LVM thinpool for docker storage on Fedora 22

2015-09-08T10:31:58+01:00

TL;DR: You can use docker-storage-setup without root fs being on LVM by passing DEVS and VG environment variables to the script or editing /etc/sysconfig/docker-storage-setup

I stumbled across this article the other day ‘Friends Don’t Let Friends Run Docker on Loopback in Production’

I also saw this bug being raised, saying docker-storage-setup doesn’t work with the Fedora 22 cloud image, as the root fs isn’t on LVM.

I decided to try this out, so I created some block storage and a Fedora 22 VM on the Rackspace cloud:

$ cinder create --display-name docker-storage --volume-type 1fd376b5-c84e-43c5-a66b-d895cb75ac2c 75
# Verify that it's built and is available
$ cinder show 359b01b7-541c-4f4d-b2e7-279d778079a4
# Build a Fedora 22 server with the volume attached
nova boot --image 2cc5db1b-2fc8-42ae-8afb-d30c68037f02 \
--flavor performance1-1 \
--block-device-mapping xvdb=359b01b7-541c-4f4d-b2e7-279d778079a4 \
docker-storage-test

Once on the machine, I followed the article above:

$ dnf -y install docker
$ systemctl stop docker
$ rm -rf /var/lib/docker/

And here’s where the bug report I linked earlier comes into play. docker-storage-setup is just a bash script, and if you just take a look at this output:

docker-storage-setup --help
Usage: /usr/bin/docker-storage-setup [OPTIONS]

Grows the root filesystem and sets up storage for docker.

Options:
  -h, --help            Print help message.

It sure gives the impresson of only doing one single thing - growing the root FS! As the bug rightly points out, the Fedora cloud image doesn’t come with LVM for the root FS (which is a good thing!), so there’s no VG for this script to grow.

So unless you read the script, or the manpage, you wouldn’t necessarily notice that what --help says is just the default behaviour, and you can use docker-storage-setup to use an emphemeral disk and leave the root fs alone. The kicker lies in two environment variables (as opposed to arguments to the script itself, like is more common); $DEVS and $VG. If you supply both of those, and the disk you give in DEVS has no partition table and the VG you supply doesn’t exist, the script will partition the disk and create all the necessary bits for LVM on that disk:

# Verify that ephemeral disk has no partition table:
$ partx -s /dev/xvdb
partx: /dev/xvdb: failed to read partition table

# Start lvmetad
$ systemctl start lvm2-lvmetad

$ DEVS="/dev/xvdb" VG="docker-data" docker-storage-setup
  Volume group "xvda1" not found
  Cannot process volume group xvda1
Checking that no-one is using this disk right now ... OK

Disk /dev/xvdb: 75 GiB, 80530636800 bytes, 157286400 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

>>> Script header accepted.
>>> Created a new DOS disklabel with disk identifier 0x2b7ebb69.
Created a new partition 1 of type 'Linux LVM' and of size 75 GiB.
/dev/xvdb2:
New situation:

Device     Boot Start       End   Sectors Size Id Type
/dev/xvdb1       2048 157286399 157284352  75G 8e Linux LVM

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
  Physical volume "/dev/xvdb1" successfully created
  Volume group "docker-data" successfully created
  Rounding up size to full physical extent 80.00 MiB
  Logical volume "docker-poolmeta" created.
  Logical volume "docker-pool" created.
  WARNING: Converting logical volume docker-data/docker-pool and docker-data/docker-poolmeta to pool's data and metadata volumes.
  THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
  Converted docker-data/docker-pool to thin pool.
  Logical volume "docker-pool" changed.

# Verify that the script wrote the docker-storage file
$ cat /etc/sysconfig/docker-storage
DOCKER_STORAGE_OPTIONS=--storage-driver devicemapper --storage-opt dm.fs=xfs
--storage-opt dm.thinpooldev=/dev/mapper/docker--data-docker--pool
--storage-opt dm.use_deferred_removal=true

# Verify that the LV is there:
$ lvs
  LV          VG          Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-pool docker-data twi-a-t--- 44.95g             0.00   0.07

So now the script has created the LV thinpool, and written the required docker configuration.

$ systemctl start docker
$ docker info
Containers: 0
Images: 0
Storage Driver: devicemapper
 Pool Name: docker--data-docker--pool
 Pool Blocksize: 524.3 kB
 Backing Filesystem: extfs
 Data file:
 Metadata file:
 Data Space Used: 19.92 MB
 Data Space Total: 48.26 GB
 Data Space Available: 48.24 GB
 Metadata Space Used: 65.54 kB
 Metadata Space Total: 83.89 MB
 Metadata Space Available: 83.82 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Library Version: 1.02.93 (2015-01-30)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.0.8-300.fc22.x86_64
Operating System: Fedora 22 (Twenty Two)
CPUs: 1
Total Memory: 987.8 MiB
Name: docker-storage-test
ID: EYKV:Q5D6:4F3Y:Z5X3:ZILX:ZBVI:2YF6:VHD7:RFQS:IWWO:MOFL:EWO7

No trace of /dev/loop0! And to verify that it’s actually using our thinpool:

$ lvdisplay | egrep "Allocated pool data" ; du -sh /var/lib/docker/ ; docker pull centos:6 ; du -sh /var/lib/docker ; lvdisplay | egrep "Allocated pool data"
  Allocated pool data    0.04%
5.6M    total
6: Pulling from docker.io/centos
47d44cb6f252: Pull complete
6a7b54515901: Pull complete
e788880c8cfa: Pull complete
1debf8fb53e6: Pull complete
72703a0520b7: Already exists
docker.io/centos:6: The image you are pulling has been verified. Important: image verification is a tech preview feature and should not be relied on to provide security.
Digest: sha256:5436a8b20d6cdf638d936ce1486e277294f6a1360a7b630b9ef76b30d9a88aec
Status: Downloaded newer image for docker.io/centos:6
5.8M    total
  Allocated pool data    0.53%

In conclusion - the script could definitely do with being updated to using command line arguments for this, rather than environment variables, and update the –help output to highlight this.

readdir and directories on xfs

2015-01-16T13:34:46+00:00

Recently I had some pretty unexpected results from a piece of code I wrote quite a while ago, and never had any issues with. I ran my program on a brand new CentOS 7 installation, and the results weren’t at all what I was used to!

Consider the following code (abridged and simplified):

readdir_xfs.c

#include 
#include 
#include 

void recursive_dir(const char *path){

  DIR *dir;
  struct dirent *de;

  if (!(dir = opendir(path))){
    perror("opendir");
    return;
  }
  if (!(de = readdir(dir))){
    perror("readdir");
    return;
  }

  do {

    if (strncmp (de->d_name, ".", 1) == 0 || strncmp (de->d_name, "..", 2) == 0) {
      continue;
    }

    if (de->d_type == DT_DIR){
      char full_path[PATH_MAX];
      snprintf(full_path, PATH_MAX, "%s/%s", path, de->d_name);
      printf("Dir: %s\n", full_path);
      recursive_dir(full_path);
    }
    else {
      printf("\tFile: %s%s\n", path, de->d_name);
    }
  } while (de = readdir(dir));
  closedir(dir);

}

int main(int argc, char *argv[]){

  if (argc < 2){
    fprintf(stderr, "Usage: %s \n", argv[0]);
    return 1;
  }

  recursive_dir(argv[1]);
  return 0;
}

Pretty straight forward - reads directories, prints out them and the files within them. Now here’s the kicker:

$ gcc -g dirtraverse.c -o dirtraverse && ./dirtraverse /data_ext4/
Dir: /data_ext4//dir1
        File: /data_ext4//dir1file3
        File: /data_ext4//dir1file1
        File: /data_ext4//dir1file2
Dir: /data_ext4//dir2
        File: /data_ext4//dir2file1
Dir: /data_ext4//dir3
$ rsync -a --delete /data_ext4/ /data_xfs/  # Ensure directories are identical
$ gcc -g dirtraverse.c -o dirtraverse && ./dirtraverse /data_xfs/
        File: /data_xfs/dir1
        File: /data_xfs/dir2
        File: /data_xfs/dir3

No traversal?

After a bit of head scratching, and a few debug statements, I found that when using readdir(3) on XFS, dirent->d_type is always 0! No matter what type of file it is. This means that line #25 can never be true.

To be fair though, the manpage states that POSIX only mandates dirent->d_name.

So to be absolutely sure your directory traversal code is more portable, make use of stat(2) and the S_ISDIR() macro!

How does MySQL hide the command line password in ps?

2012-03-10T05:03:46+00:00

I saw this question asked today, and thought I’d write a quick post about it. Giving passwords on the command line isn’t necessarily a fantastic idea - but you can sort of see where they’re coming from. Configuration files and environment variables are better, but just slightly. Security is a night mare!

But if you do decide to write an application which takes a password (or any other sensitive information) on the command line, you can prevent other users on the system from easily seeing it like this:

#include 
#include 
#include 
#include 

int main(int argc, char *argv[]){

    int i = 0;
    pid_t mypid = getpid();
    if (argc == 1)
        return 1;
    printf("argc = %d and arguments are:\n", argc);
    for (i ; i < argc ; i++)
        printf("%d = %s\n" ,i, argv[i]);
    printf("Replacing first argument with x:es... Now open another terminal and run: ps p %d\n", (int)mypid);
    fflush(stdout);
    memset(argv[1], 'x', strlen(argv[1]));
    getc(stdin);
        return 0;
}

A sample run looks like this:

$ ./pwhide abcd
argc = 2 and arguments are:
0 = ./pwhide
1 = abcd
Replacing first argument with x:es... Now run: ps p 27913

In another terminal:

$ ps p 27913
  PID TTY      STAT   TIME COMMAND
27913 pts/1    S+     0:00 ./pwhide xxxx

In the interest of brevity, the above code isn’t very portable - but it works on Linux and hopefully the point of it comes across. In other environments, such as FreeBSD, you have the setproctitle() syscall to do the dirty work for you. The key thing here is the overwriting of argv[1] Because the size of argv[] is allocated when the program starts, you can’t easily obfuscate the length of the password. I say easily - because of course there is a way.

Font rendering - no more jealousy

2012-02-28T20:02:17+00:00

I suppose this kind of content is what most people use twitter for these days. But since I’ve remained strong and stayed well away from that, I suppose I will have to be a tad retro and write a short blog post about it. If you like me are an avid Fedora user, I’m sure you’ve thrown glances at colleague’s or friend’s Ubuntu machines and thought that there was something that was slightly different about the way it looked (aside from the obvious Gnome vs Unity differences). Shinier somehow…; So had I, but I mainly dismissed it as a case of “the grass is always greener…”.

It turns out that the grass actually IS greener.

Tonight I stumbled upon this. It’s a patched version of freetype. For what I assume are political reasons (free as in speech), Fedora ships a Freetype version without subpixel rendering. These patches fixes that and other things.

With a default configuration file of 407 lines, it’s quite extensible and configurable as well. Lucky, I quite like the default!

If you’re not entirely happy with the way your fonts look on Fedora - it’s well worth a look

Transactions and code testing

2011-08-18T13:08:29+01:00

A little while ago I worked with a customer to migrate their DB from using MyISAM to InnoDB (something I definitely don’t mind doing!) I set up a smaller test instance with all tables using the InnoDB engine as part of the testing. I instructed them to thoroughly test their application against this test instance and let me know if they identified any issues.

They reported back that everything seemed fine, and we went off to do the actual migration. Everything went according to plan and things seemed well. After a while they started seeing some discrepancies in the stock portion of their application. The data didn’t add up with what they expected and stock levels seemed surprisingly high. A crontabbed program was responsible for periodically updating the stock count of products, so this was of course the first place I looked. I ran it manually and looked at its output; it was very verbose and reported some 2000 products had been updated. But looking at the actual DB, this was far from the case.

Still having the test environment available, I ran it a few times against that and could see the com_update and com_insert counters being incremented, so I knew the queries were making it there. But the data remained intact. At this point, I had a gut feeling what was going on.. so to confirm this, I enabled query logging to see what was actually going on. It didn’t take me long to spot the problem. On the second line of the log, I saw this:

       40 Query set autocommit=0

The program responsible for updating the stock levels was a python script using MySQLDB. I couldn’t see any traces of autocommit being set explicitly, so I went on assuming that it was off by default (which turned out to be correct). After adding cursor.commit()* after the relevant queries had been sent to the server, everything was back to normal as far as stock levels were concerned. Since the code itself was seeing its own transaction, calls such as cursor.rowcount which the testers had relied on were all correct.

But the lesson here; when testing your software from a database point of view, don’t blindly trust what your code tells you it’s done, make sure it’s actually done it by verifying the data! A lot of things can happen to data between your program and the platters. Its transaction can deadlock and be rolled back, it can be reading cached data, it can get lost in a crashing message queue, etc.

As a rule of thumb, I’m rather against setting a blanket autocommit=1 in code, I’ve seen that come back to haunt developers in the past. I’m a strong advocate for explicit transaction handling.

Find out what is using your swap

2011-05-27T16:46:40+01:00

This article is now over five years old, please consider reading a more recent version

Have you ever logged in to a server, ran free, seen that a bit of swap is used and wondered what’s in there? It’s usually not very indicative of anything, or even overly helpful knowing what’s in there, mostly it’s a curiosity thing.

Either way, starting from kernel 2.6.16, we can find out using smaps which can be found in the proc filesystem. I’ve written a simple bash script which prints out all running processes and their swap usage. It’s quick and dirty, but does the job and can easily be modified to work on any info exposed in /proc/$PID/smaps If I find the time and inspiration, I might tidy it up and extend it a bit to cover some more alternatives. The output is in kilobytes.

#!/bin/bash
# Get current swap usage for all running processes
# Erik Ljungstrom 27/05/2011
SUM=0
OVERALL=0
for DIR in `find /proc/ -maxdepth 1 -type d | egrep "^/proc/[0-9]"` ; do
        PID=`echo $DIR | cut -d / -f 3`
        PROGNAME=`ps -p $PID -o comm --no-headers`
        for SWAP in `grep Swap $DIR/smaps 2>/dev/null| awk '{ print $2 }'`
        do
                let SUM=$SUM+$SWAP
        done
        echo "PID=$PID - Swap used: $SUM - ($PROGNAME )"
        let OVERALL=$OVERALL+$SUM
        SUM=0

done
echo "Overall swap used: $OVERALL"

This will need to be ran as root for it to be able to gather accurate numbers. It will still work even if you don’t, but it will report 0 for any processes not owned by your user. Needless to say, it’s Linux only. The output is ordered alphabetically according to your locale (which admittedly isn’t a great thing since we’re dealing with numbers), but you can easily apply your standard shell magic to the output. For instance, to find the process with most swap used, just run the script like so:

$ ./getswap.sh | sort -n -k 5

Don’t want to see stuff that’s not using swap at all?

$ ./getswap.sh  | egrep -v "Swap used: 0" |sort -n -k 5

…; and so on and so forth

Example using Cassandra with Thrift in C++

2011-05-21T20:09:46+01:00

Due to a very exciting, recently launched project at work, I’ve had to interface with Cassandra through C++ code. As anyone who has done this can testify, the API docs are vague at best, and there are very few examples out there. The constant API changes between 0.x versions and the fact that the Cassandra API has its docs and Thrift has its own, but there is nothing bridging the two isn’t helpful either. So at the moment it is very much a case of dissecting header files and looking at implementation in the Thrift generated source files.

The only somewhat useful example of using Cassandra with C++ one can find online is this, but due to the API changes, this is now outdated (it’s still worth a read).

So in the hope that nobody else will have to spend the better part of a day piecing things together to achieve even the most basic thing, here’s an example which works with Cassandra 0.7 and Thrift 0.6.

First of all, create a new keyspace and a column family, using cassandra-cli:

[default@unknown] create keyspace nm_example;
c647b2c0-83e2-11e0-9eb2-e700f669bcfc
Waiting for schema agreement...
... schemas agree across the cluster
[default@unknown] use nm_example;
Authenticated to keyspace: nm_example
[default@nm_example] create column family nm_cfamily with comparator=BytesType and default_validation_class=BytesType;
30466721-83e3-11e0-9eb2-e700f669bcfc
Waiting for schema agreement...
... schemas agree across the cluster
[default@nm_example]

Now go to the directory where you have cassandra installed and enter the interface/ directory and run: thrift -gen cpp cassandra.thrift This will create the gen-cpp/ directory. From this directory, you need to copy all files bar the Cassandra_server.skeleton.cpp one to wherever you intend to keep your sources. Here’s some example code which inserts, retrieves, updates, retrieves and deletes keys:

#include "Cassandra.h"

#include 
#include 
#include 

using namespace std;
using namespace apache::thrift;
using namespace apache::thrift::protocol;
using namespace apache::thrift::transport;
using namespace org::apache::cassandra;
using namespace boost;

static string host("127.0.0.1");
static int port= 9160;

int64_t getTS(){
    /* If you're doing things quickly, you may want to make use of tv_usec
     * or something here instead
     */
    time_t ltime;
    ltime=time(NULL);
    return (int64_t)ltime;

}

int main(){
    shared_ptr<TTransport> socket(new TSocket(host, port));
    shared_ptr<TTransport> transport(new TFramedTransport(socket));
    shared_ptr<TProtocol> protocol(new TBinaryProtocol(transport));
    CassandraClient client(protocol);

    const string&#038; key="your_key";

    ColumnPath cpath;
    ColumnParent cp;

    ColumnOrSuperColumn csc;
    Column c;

    c.name.assign("column_name");
    c.value.assign("Data for our key to go into column_name");
    c.timestamp = getTS();
    c.ttl = 300;

    cp.column_family.assign("nm_cfamily");
    cp.super_column.assign("");

    cpath.column_family.assign("nm_cfamily");
    /* This is required - thrift 'feature' */
    cpath.__isset.column = true;
    cpath.column="column_name";

    try {
        transport->open();
        cout << "Set keyspace to 'dpdns'.." << endl;
        client.set_keyspace("nm_example");

        cout << "Insert key '" << key << "' in column '" << c.name << "' in column family '" << cp.column_family << "' with timestamp " << c.timestamp << "..." << endl;
        client.insert(key, cp, c, org::apache::cassandra::ConsistencyLevel::ONE);

        cout << "Retrieve key '" << key << "' from column '" << cpath.column << "' in column family '" << cpath.column_family << "' again..." << endl;
        client.get(csc, key, cpath, org::apache::cassandra::ConsistencyLevel::ONE);
        cout << "Value read is '" << csc.column.value << "'..." << endl;

        c.timestamp++;
        c.value.assign("Updated data going into column_name");
        cout << "Update key '" << key << "' in column with timestamp " << c.timestamp << "..." << endl;
        client.insert(key, cp, c, org::apache::cassandra::ConsistencyLevel::ONE);

        cout << "Retrieve updated key '" << key << "' from column '" << cpath.column << "' in column family '" << cpath.column_family << "' again..." << endl;
        client.get(csc, key, cpath, org::apache::cassandra::ConsistencyLevel::ONE);
        cout << "Updated value is: '" << csc.column.value << "'" << endl;

        cout << "Remove the key '" << key << "' we just retrieved. Value '" << csc.column.value << "' timestamp " << csc.column.timestamp << " ..." << endl;
        client.remove(key, cpath, csc.column.timestamp, org::apache::cassandra::ConsistencyLevel::ONE);

        transport->close();
    }
    catch (NotFoundException &#038;nf){
        cerr << "NotFoundException ERROR: "<< nf.what() << endl;
    }
    catch (InvalidRequestException &#038;re) {
        cerr << "InvalidRequest ERROR: " << re.why << endl;
    }
    catch (TException &#038;tx) {
        cerr << "TException ERROR: " << tx.what() << endl;
    }

    return 0;
}

Say we’ve called the file cassandra_example.cpp, and you have the files mentioned above in the same directory, you can comile things like this:

$ g++ -lthrift -Wall  cassandra_example.cpp cassandra_constants.cpp Cassandra.cpp cassandra_types.cpp -o cassandra_example
$ ./cassandra_example
Set keyspace to 'nm_example'..
Insert key 'your_key' in column 'column_name' in column family 'nm_cfamily' with timestamp 1306008338...
Retrieve key 'your_key' from column 'column_name' in column family 'nm_cfamily' again...
Value read is 'Data for our key to go into column_name'...
Update key 'your_key' in column with timestamp 1306008339...
Retrieve updated key 'your_key' from column 'column_name' in column family 'nm_cfamily' again...
Updated value is: 'Updated data going into column_name'
Remove the key 'your_key' we just retrieved. Value 'Updated data going into column_name' timestamp 1306008339 ...

Another thing worth mentioning is Padraig O'Sullivan’s libcassandra, which may or may not be worth a look depending on what you want to do and what versions of Thrift and Cassandra you’re tied to.

Site slow after scaling out? Yeah, possibly!

2011-03-29T06:03:46+01:00

Every now and then, we have customers who outgrow their single server setup. The next natural step is of course splitting the web layer from the DB layer. So they get another server, and move the database to that.

So far so good! A week or so later, we often get the call “Our page load time is higher now than before the upgrade! We’ve got twice as much hardware, and it’s slower! You have broken it!” It’s easy to see where they’re coming from. It makes sense, right?

That is until you factor in the newly introduced network topology! Today it’s not unusual (that’s not to say it’s acceptable or optimal) for your average wordpress/drupal/joomla/otherspawnofsatan site to run 40-50 queries per page load. Quite often even more!

Based on a tcpdump session of a reasonably average query (if there is such a thing), connecting to a server, authenticating, sending a query and receiving a 5 row result set of 1434 bytes yields 25 packets being sent between my laptop and a remote DB server on the same wired, non-congested network. A normal, average latency of TCP/IP over Ethernet is ~0.2 ms for the size of packets we’re talking here. So, doing the maths, you’re seeing 25*0.2*50=250ms in just network latency per page load for your SQL queries. This is obviously a lot more than you see over a local UNIX socket.

This is inevitable, laws of physics. It is nothing you, your sysadmin and/or your hosting company can do anything about. There may however be something your developer can do about the amount of queries! You also shouldn’t confuse response-times with availability. Your response times may be slower, but you can (hopefully) serve a lot more users with this setup!

Sure, there are technologies out there which have considerably less latency than ethernet, but they come with quite the price-tag, and there are more often than not quite a few avenues to go down before it makes sense to start looking at that kind of thing.

You could also potentially looking at running the full stack on both machines using master/master replication for your DBs, and load balance your front-ends and have them both read locally, but only write to one node at a time! That kind of DB scenario is something fairly easily set up using mmm for MySQL. But in my experience, this often ends up more costly and potentially introducing more complexities than it solves. I’m an avid advocate for keeping server roles separate as much as possible!

A look at mysql-5-5 semi-synchronous replication

2010-10-09T20:19:54+01:00

Now that MySQL 5.5 is in RC, I decided to have a look at the semi synchronous replication. It’s easy to get going, and from my very initial tests appear to be working a treat.

This mode of replication is called semi synchronous due to the fact that it only guarantees that at least one of the slaves have written the transaction to disk in its relay log, not actually committed it to its data files. It guarantees that the data exists by some means somewhere, but not that it’s retrievable through a MySQL client.

Semi sync is available as a plugin, and if you compile from source, you’ll need to do –with-plugins=semisync…. So far, the semisync plugin can only be built as a dynamic module, so you’ll need to install it once you’ve got your instance up and running. To do this, you do as with any other plugin:

install plugin rpl_semi_sync_master soname 'semisync_master.so';
install plugin rpl_semi_sync_slave soname 'semisync_slave.so';

You might get an 1126 error and a message saying “Can’t open shared library..”, then you most likely need to set the plugin_dir variable in my.cnf and give MySQL a restart. If you’re using a master/slave pair, you obviously won’t need to load both modules as above. You load the slave one on your slave, and the master one on your master. Once you’ve done this, you’ll have entries for these modules in the mysql.plugin table. When you have confirmed that you do, you can safely add the pertinent variables to your my.cnf, the values I used (in addition to the normal replication settings) for my master/master sandboxes were:

plugin_dir=/opt/mysql-5.5.6-rc/lib/mysql/plugin/
rpl_semi_sync_master_enabled=1
rpl_semi_sync_master_timeout=10000
rpl_semi_sync_slave_enabled=1
rpl_semi_sync_master_trace_level=64
rpl_semi_sync_slave_trace_level=64
rpl_semi_sync_master_wait_no_slave=1

Note that you probably won’t want to use these values for _trace_level in production due to the verbosity in the log! I just enabled these while testing. Also note that the timeout is in milliseconds. You can also set these on the fly with SET GLOBAL (thanks Oracle!), just make sure the slave is stopped before doing this, as it needs to be enabled during the handshake with the master for the semisync to kick in.

The timeout is the amount of time the master will lock and wait for a slave to acknowledge the write before giving up on the whole idea of semi synchronous operation and continue as normal. If you want to monitor this, you can use the status variable Rpl_semi_sync_master_status which is set to Off when this happens. If this condition should be avoided altogether, you would need to set a large enough value for the timeout and a low enough monitoring threshold as there doesn’t seem to be a way to force MySQL to wait forever for a slave to appear.

If you’re running an automated failover setup, you’ll want to set the timeout higher than your heartbeat, so ensuring no committed data is lost. Then you might also want to set the timeout considerably lower initially on the passive master so that you don’t end up waiting on the master we know is unhealthy and have just failed over from.

Before implementing this in production, I would strongly recommend running a few performance tests against your setup as this will slow things down considerably for some workloads. Each transaction has to be written to the binlog, read over the wire and written to the relay log, and then lastly flushed to disk before each DML statement returns. You will almost definitely benefit in batching up queries into larger transactions rather than using the default auto commit mode as this will increase the frequency of the steps. Update: Even though the manual clearly states that the event has to be flushed to disk, this doesn’t actually appear to be the case (see comments). The above still stands, but the impact may not be as great as first thought

When I find the time, I will run some benchmarks on this.

Lastly, please note that this is written while MySQL 5.5 is still in release candidate stage, so while unlikely, things are subject to change. So please be mindful of this in future comments.

GlusterFS init script and Puppet

2010-08-09T08:08:14+01:00

The other day I had quite the head scratcher. I was setting up a new environment for a customer which included the usual suspects in a LAMP stack spread across a few virtual machines in an ESXi cluster. As the project is quite volatile in terms of requirements, amount of servers, server roles, location etc. I decided to start off using Puppet to make my life easier further down the road.

I got most of it set up, and got started on writing up the glusterfs Puppet module. Fairly straight forward, a few directories, configuration files and a mount point. Then I came to the Service declaration, and of course we want this to be running at all times, so I went on and wrote:

service { "glusterfsd":
    ensure => running,
    enable => true,
    hasrestart => true,
    hasstatus => true,
}

expecting glusterfsd to be running shortly after I purposefully stopped it. But it didn’t. So I dove into puppet (Yay Ruby!) and deduced that the way it determines whether something is running or not is the return code of: /sbin/service servicename status

So a quick look in the init script which ships with glusterfs-server shows that it calls the stock init function “status” on glusterfsd, which is perfectly fine, but then it doesn’t exit with the return code from this function, it simply runs out of scope and exits with the default value of 0.

So to get around this, I made a quick change to the init script and used the return code from the “status” function (/etc/rc.d/init.d/functions on RHEL5) and exited with $?, and Puppet had glusterfsd running within minutes.

I couldn’t find anything when searching for this, so I thought I’d make a note of it here.

Legitimate emails being dropped by Spamassassin in RHEL5

2010-05-26T19:05:34+01:00

ver the past few months, an increasing number of customers have complained that their otherwise OK spam filters have started dropping an inordinate amount of legitimate emails. The first reaction is of course to increase the score required to be filtered, but that just opens up for more spam. I looked in the quarantine on one of these servers, and ran a few of the legitimate ones through spamassassin in debug mode. I noticed one particular rule which was prevalent in the vast majority of the emails. Here’s an example:

...
[2162] dbg: learn: initializing learner
[2162] dbg: check: is spam? score=4.004 required=6
[2162] dbg: check: tests=FH_DATE_PAST_20XX,HTML_MESSAGE,SPF_HELO_PASS
...

4 is obviously quite a high score for an email whose only flaw is being in HTML. But FH_DATE_PAST_20XX caught my eye in all of the outputs. So to the rule files:

$ grep FH_DATE_PAST_20XX /usr/share/spamassassin/72_active.cf
##{ FH_DATE_PAST_20XX
header   FH_DATE_PAST_20XX      Date =~ /20[1-9][0-9]/ [if-unset: 2006]
describe FH_DATE_PAST_20XX      The date is grossly in the future.
##} FH_DATE_PAST_20XX

Aha. This is a problem. With 50_scores.cf containing this:

$ grep FH_DATE_PAST /usr/share/spamassassin/50_scores.cf
score FH_DATE_PAST_20XX 2.075 3.384 3.554 3.188 # n=2

there’s no wonder emails are getting dropped! I guess this is a problem one can expect when running a distribution with packages 6 years old and neglect to frequently (or at least every once in a while) update the rules!

Luckily, this rule is gone altogether from RHEL6’s version of spamassassin.

Control-groups in rhel6

2010-05-13T09:26:51+01:00

One new feature that I’m very enthusiastic about in RHEL6 is Control Groups (cgroup for short). It allows you to create groups and allocate resources to these. You can then bunch your applications into groups at your heart’s content.

It’s relatively simple to set up, and configuration can be done in two different ways. You can use the supplied cgset command, or if you’re accustomed to doing it the usual way when dealing with kernel settings, you can simply echo values into the pseudo-files under the control group.

Here’s a controlgroup in action:

[root@rhel6beta cgtest]# grep $$ /cgroup/gen/group1/tasks
1138
[root@rhel6beta cgtest]# cat /cgroup/gen/group1/memory.limit_in_bytes
536870912
[root@rhel6beta cgtest]# gcc alloc.c -o alloc && ./alloc
Allocating 642355200 bytes of RAM,,,
Killed
[root@rhel6beta cgtest]# echo `echo 1024*1024*1024| bc` >
/cgroup/gen/group1/memory.limit_in_bytes
[root@rhel6beta cgtest]# ./alloc
Allocating 642355200 bytes of RAM,,,
Successfully allocated 642355200 bytes of RAM, captn' Erik...
[root@rhel6beta cgtest]#

The first line shows that the shell which launches the app is under the control of the cgroup group1, so subsequently all it’s child processes are subject to the same restrictions.

As you can also see, the initial memory limit in the group is 512M. Alloc is a simple C app I wrote which calloc()s 612M of RAM (for demonstrative purposes, I’ve disabled swap on the system altogether). At the first run, the kernel kills the process in the same way it would if the whole system had run out of memory. The kernel message also indicates that the control group ran out of memory, and not the system as a whole:

...
May 13 17:56:20 rhel6beta kernel: Memory cgroup out of memory: kill process
1710 (alloc) score 9861 or a child
May 13 17:56:20 rhel6beta kernel: Killed process 1710 (alloc)

Unfortunately it doesn’t indicate which cgroup the process belonged to. Maybe it should?

cgroups doesn’t just give you the ability to limit the amount of RAM, it has a lot of tuneables. You can even set swappiness on a per-group basis! You can limit the devices applications are allowed to access, you can freeze processes as well as tag outgoing network packets with a class ID, in case you want to do shaping or profiling on your network! Perfect if you want to prioritise SSH traffic over anything else, so you can comfortably worked even when your uplink is saturated. Furthermore, you can easily get an overview of memory usage, CPU accounting etc. of applications in any given group.

All this means you can clearly separate resources and to quite a large extent ensure that some applications won’t starve the whole system, or each other from resources. Very handy, no more waiting for half an hour for the swap to fill up and OOM to kick (and often chose the wrong PID) in when customer’s applications have run astray.

A much welcomed addition to RHEL!

boot loader not installed in rhel6 beta

2010-04-10T12:01:52+01:00

Just a heads up I thought I’d share in the hope that it’ll save someone some time, when installing RHEL6 beta under Xen, be aware that pygrub currently can’t handle /boot being on ext4 (which is the default). So in order to run rhel6 under xen, ensure that you modify the partition layout during the installation process.

This turned out to be a real head scratcher for me, and initially I thought the problem was something else as Xen wasn’t being very helpful with error messages.

Hopefully there’ll be an update for this soon!

building hiphop-php gotcha

2010-02-21T11:17:51+00:00

Tonight I’ve delved into the world of Facebook’s HipHop for PHP. Let me early on point out that I’m not doing so because I believe that I will need it any time soon, but I am convinced that I without a shadow of a doubt will be approached by customers who think they do, and I rather not have opinions or advise against things I haven’t tried myself or at least have a very good understanding of.

Unfortunately I set about this task on an RHEL 5.4 box, and it hasn’t been a walk in the park. Quite a few dependencies were out of date or didn’t exist in the repositories, libicu, boost, onig, tbb etc.

Though, CMake did a good job of telling me what was wrong, so it wasn’t a huge deal, I just compiled the missing pieces from source and put them in $CMAKE_PREFIX_PATH. One thing CMake didn’t pick up on however, was that the flex version shipped with current RHEL is rather outdated. Once I thought I had everything configured, I set about the compilation, and my joy was swiftly abrupted by this:

[  3%] [FLEX][XHPScanner] Building scanner with flex /usr/bin/flex version
2.5.4
/usr/bin/flex: unknown flag '-'.  For usage, try /usr/bin/flex --help

Not entirely sure what it was actually doing here, I took the shortcut of replacing /usr/bin/flex with a shell script which just exited after putting $@ in a file in /tmp/ and re-ran make. Looking in the resulting file, this is the argument flex is given:

-C --header-file=scanner.lex.hpp
-o/home/erik/dev/hiphop-php/src/third_party/xhp/xhp/scanner.lex.cpp
/home/erik/dev/hiphop-php/src/third_party/xhp/xhp/scanner.l

To me that looks quite valid, and there’s certainly no single – in that command line.

Long story short, flex introduced –header-file in a relatively “recent” version (2.5.33 it seems, but I may be wrong on that one, doesn’t matter). Unlike most other programs (using getopt), it won’t tell you Invalid option ‘–header-file’. So after compiling a newer version of flex, I was sailing again.

Development; just as important as dual nics

2010-02-13T17:38:56+00:00

There is a popular saying which I find you can apply to most things in life; “You get what you pay for”. Sadly, this does not seem to apply for software development in any way. You who know me know that I work for a reasonably sized hosting company in the upper market segment. We have thousands of servers and hundreds of customers, so after a while you get a quite decent overview of how things work and a vast arsenal of “stories from the trenches”.

So here’s a small tip; ensure that your developers know what they are doing! It will save you a lot of hassle and money in the long run.

Without having made a science out of it, I can confidently say that at the very least 95% of the downtime I see on a daily basis is due to faulty code in the applications running on our servers.

So after you’ve demanded dual power feeds to your rack, bonded NICs and a gazillion physical paths to your dual controller SAN, it would make sense to apply the same attitude towards your developers. After all, they are carbon based humans and are far more likely to break than your silicon NIC. Now unfortunately it is not as simple as “if I pay someone a lot of money and let them do their thing, I will get good solid code out of it”, so a great deal of due diligence is required in this part of your environment as well. I have seen more plain stupid things coming from 50k pa. people than I care to mention, and I have seen plain brilliant things coming out of college kids’ basements.

This is important not only from an availability point of view, it’s also about running cost. The amount of hardware in our data centers which is completely redundant, and would easily be made obsolete with a bit of code and database tweaking is frightening. So you think you may have cut a great deal when someone said they could build your e-commerce system in 3 months for 10k less than other people have quoted you. But in actual fact, all you have done is got someone to effectively re-brand a bloated, way too generic, stock framework/product which the developer has very little insight into and control over. Yes, it works if you “click here, there and then that button”, the right thing does appear on the screen. But only after executing hundreds of SQL queries, looking for your session in three different places, done four HTTP redirects, read five config files and included 45 other source files. Needless to say, those one-off 10k you think you have saved, will be swallowed in recurring hardware cost in no time. You have probably also severely limited your ability to scale things up in the future.

So in summary, don’t cheap out on your development but at the same time don’t think that throwing money at people will make them write good code. Ask someone else to look things over every now and then, even if it will cost you a little bit. Use the budget you were planning on spending on the SEO consultant. Let it take time.