“My web site is slow”
I just picked the type of issue for this article at random, this can be
applied to pretty much any sysadmin related troubleshooting.
It’s not about showing off the cleverest oneliners to find the most
information. It’s also not an exhaustive, step-by-step “flowchart” with the
word “profit” in the last box.
It’s about general approach, by means of a few examples.
The example scenarios are solely for illustrative purposes. They sometimes
have a basis in assumptions that doesn’t apply to all cases all of the time, and I’m
positive many readers will go “oh, but I think you will find…” at some point.
But that would be missing the point.
Having worked in support, or within a support organization for over a decade,
there is one thing that strikes me time and time again and that made me write
this;
The instinctive reaction many techs have when facing a problem, is
to start throwing potential solutions at it.
“My website is slow”
MaxClients/MaxRequestWorkers/worker_connections
innodb_buffer_pool_size/effective_cache_size
mod_gzip
(true story, sadly)“I saw this issue once, and then it was because X. So I’m going to try to fix X again, it might work”.
This wastes a lot of time, and leads you down a wild goose chase. In the dark. Wearing greased mittens.
InnoDB’s buffer pool may well be at 100% utilization, but that’s just because
there are remnants of a large one-off report someone ran a while back in there.
If there are no evictions, you’ve just wasted time.
At this point, I should mention that while it’s equally applicable to many roles, I’m writing this from a general support system adminstrator’s point of view. In a mature, in-house organization or when working with larger, fully managed or “enterprise” customers, you’ll typically have everything instrumented, measured, graphed, thresheld (not even word) and alerted on. Then your approach will often be rather different. We’re going in blind here.
If you don’t have that sort of thing at your disposal;
Establish what the issue actually is. “Slow” can take many forms. Is it time to first byte? That’s a whole different class of problem from poor Javascript loading and pulling down 15 MB of static assets on each page load. Is it slow, or just slower than it usually is? Two very different plans of attack!
Make sure you know what the issue reported/experienced actually is before you
go off and do something. Finding the source of the problem is often difficult
enough, without also having to find the problem itself.
That is the sysadmin equivalent of bringing a knife to a gunfight.
You are allowed to look for a few usual suspects when you first log in to a
suspect server. In fact, you should! I tend to fire off a smattering of commands
whenever I log in to a server to just very quickly check a few things; Are we
swapping (free/vmstat
), are the disks busy (top/iostat/iotop
), are we dropping
packets (netstat/proc/net/dev
), is there an undue amount of connections in an
undue state (netstat
), is something hogging the CPUs (top
), is someone else on
this server (w/who
), any eye-catching messages in syslog and dmesg
?
There’s little point to carrying on if you have 2000 messages from your RAID controller about how unhappy it is with its write-through cache.
This doesn’t have to take more than half a minute. If nothing catches your eye – continue.
If there indeed is a problem somewhere, and there’s no low hanging fruit to be found;
Take all steps you can to try and reproduce the problem. When you can reproduce, you can observe. When you can observe, you can solve. Ask the person reporting the issue what exact steps to take to reproduce the issue if it isn’t already obvious or covered by the first section.
Now, for issues caused by solar flares and clients running exclusively on OS/2, it’s not always feasible to reproduce. But your first port of call should be to at least try! In the very beginning, all you know is “X thinks their website is slow”. For all you know at that point, they could be tethered to their GPRS mobile phone and applying Windows updates. Delving any deeper than we already have at that point is, again, a waste of time.
Attempt to reproduce!
It saddens me that I felt the need to include this. But I’ve seen escalations
that ended mere minutes after someone ran tail /var/log/..
Most *NIX tools these days
are pretty good at logging. Anything blatantly wrong will manifest itself quite
prominently in most application logs. Check it.
If there are no obvious issues, but you can reproduce the reported problem,
great.
So, you know the website is slow.
Now you’ve narrowed things down to: Browser rendering/bug, application
code, DNS infrastructure, router, firewall, NICs (all eight+ involved),
ethernet cables, load balancer, database, caching layer, session storage, web
server software, application server, RAM, CPU, RAID card, disks.
Add a smattering of other potential culprits depending on the set-up. It could
be the SAN, too. And don’t forget about the hardware WAF! And.. you get my
point.
If the issue is time-to-first-byte you’ll of course start applying known fixes
to the webserver, that’s the one responding slowly and what you know the most
about, right? Wrong!
You go back to trying to reproduce the issue. Only this time, you try to
eliminate as many potential sources of issues as possible.
You can eliminate the vast majority of potential culprits very
easily:
Can you reproduce the issue locally from the server(s)?
Congratulations, you’ve
just saved yourself having to try your fixes for BGP routing.
If you can’t, try from another machine on the same network.
If you can - at least you can move the firewall down your list of suspects, (but do keep
a suspicious eye on that switch!)
Are all connections slow? Just because the server is a web server, doesn’t mean you shouldn’t try to reproduce with another type of service. netcat is very useful in these scenarios (but chances are your SSH connection would have been lagging this whole time, as a clue)! If that’s also slow, you at least know you’ve most likely got a networking problem and can disregard the entire web stack and all its components. Start from the top again with this knowledge (do not collect $200). Work your way from the inside-out!
Even if you can reproduce locally - there’s still a whole lot of “stuff”
left. Let’s remove a few more variables.
Can you reproduce it with a flat-file? If i_am_a_1kb_file.html
is slow,
you know it’s not your DB, caching layer or anything beyond the OS and the webserver
itself.
Can you reproduce with an interpreted/executed
hello_world.(py|php|js|rb..)
file?
If you can, you’ve narrowed things down considerably, and you can focus on
just a handful of things.
If hello_world
is served instantly, you’ve still learned a lot! You know
there aren’t any blatant resource constraints, any full queues or stuck
IPC calls anywhere. So it’s something the application is doing or
something it’s communicating with.
Are all pages slow? Or just the ones loading the “Live scores feed” from a third party?
What this boils down to is; What’s the smallest amount of “stuff” that you can involve, and still reproduce the issue?
Our example is a slow web site, but this is equally applicable to almost any issue. Mail delivery? Can you deliver locally? To yourself? To <common provider here>? Test with small, plaintext messages. Work your way up to the 2MB campaign blast. STARTTLS and no STARTTLS. Work your way from the inside-out.
Each one of these steps takes mere seconds each, far quicker than implementing most “potential” fixes.
By now, you may already have stumbled across the problem by virtue of being unable to reproduce when you removed a particular component.
But if you haven’t, or you still don’t know why; Once you’ve found a way to reproduce the issue with the smallest amount of “stuff” (technical term) between you and the issue, it’s time to start isolating and observing.
Bear in mind that many services can be ran in the foreground, and/or have debugging enabled. For certain classes of issues, it is often hugely helpful to do this.
Here’s also where your traditional armory comes into play. strace
, lsof
, netstat
,
GDB
, iotop
, valgrind
, language profilers (cProfile, xdebug, ruby-prof…).
Those types of tools.
Once you’ve come this far, you rarely end up having to break out profilers or debugers though.
strace
is often a very good place to start.
You might notice that the application is stuck on a particular read()
call
on a socket file descriptor connected to port 3306 somewhere. You’ll know
what to do.
Move on to MySQL and start from the top again. Low hanging
fruit: “Waiting_for * lock”, deadlocks, max_connections.. Move on to: All
queries? Only writes? Only certain tables? Only certain storage
engines?…
You might notice that there’s a connect()
to an external API resource that
takes five seconds to complete, or even times out. You’ll know what to do.
You might notice that there are 1000 calls to fstat()
and open()
on the
same couple of files as part of a circular dependency somewhere. You’ll
know what to do.
It might not be any of those particular things, but I promise you, you’ll notice something.
If you’re only going to take one thing from this section, let it be; learn
to use strace
! Really learn it, read the whole man page. Don’t even skip
the HISTORY section. man
each syscall you don’t already know what it
does. 98% of troubleshooting sessions ends with strace.
However, sometimes you may want (or need) to colour outside the lines, where a cookie-cutter implentation either doesn’t work, or gets in the way.
One such scenario is if you have an application which needs to act both as a web front-end, with your typical cookie-based sessions, as well as an API endpoint. Requiring cookies when you’re acting as an API endpoint isn’t particularly nice, tokens in the request header is the way to go! So how can you get Flask sessions to work with both these methods of identification?
Perhaps at this point, I should add that you might be best served by reconsidering your strategy here, and make the API endpoint a distinct application from the one driving your UI. You can still share all your code for your models and logic and can even make use of a layer 7 load balancer to deal with the separation for you. But be it due to retrofitting, time constraint, legacy or otherwise imposed design.. here goes;
Since Flask is a pretty lightweight framework, it’s easily extended or wrestled into submission. Luckily for us, it offers a pluggable way to write your own session handling!
I’ve put a small example application with a custom session interface on
GitHub, which
allows what we’ve previously discussed. You can either distinguish sessions by
a cookie, if present, or a header of your chosing (cookie trumps header, if
both are present). This header defaults to the
de-facto standard X-Auth-Header
in the example, but you can configure this
easily.
For ease of use, the datastore used to store the sessions is memcached. But
it’s very easily replaced by any other datastore.
The example is as small and compact as possible while remaining runnable. There are no “bells and whistles” such as actual authentication, that’s for you to handle outside of the session handler. You will also most likely want to extend the error checking and handling.
Do note - there’s a docker-compose file
included in the repository, which will enable you to quickly get up and
running.
Alternatively you can simply run pip install -r requirements.txt && ./runserver.py
from
within the app/
directory, provided that you have the required system
dependencies.
Here’s an example of using this session handler with cookies:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Since we don’t send a JSON body containing the key token
, or set the
X-Auth-Token
header, the session handler determines the application should send a cookie.
The example has a session timeout of a mighty 30 seconds (configurable, obviously).
Now, if we were to behave like an API, on the other hand:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
As you can see, we don’t get a cookie sent back, because we behaved like an API client. We can also see that we get a brand new session after the 30 seconds has elapsed.
The example also comes with a test suite for verification. You can execute this
by simply running make tests
:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
The tests all run in a docker container, so the first time you run it, you’ll most likely see an image being built, and a memcached image being pulled.
Hope this helps someone!
]]>I’ve neglected this blog a bit in the last year or so. I’ve written a lot of documentation and given a lot of training internally at work, so there hasn’t been an enormous amount of time I’ve been able or willing to spend on it. However;
Five years ago, I wrote an article presenting a script which lets you find what processes have pages in your swap memory, and how much they consume. This article is still by far the most popular one I’ve ever written, and it still sees a fair amount of traffic, so I wanted to write a bit of an updated version, and fill in some of the things I probably should have mentioned more in depth in the original one.
Let’s get one thing out of the way first - the script in the original article
is now redundant. It actually already was redundant in certain cases when I
wrote about it, but now it’s really redeundant. To get the same information
in any currently supported distribution,
simply launch top
, press f
, scroll down to where it says SWAP
press space
followed by q
. There we go, script redundant.
With that cleared up - what I didn’t mention five years ago is why you would want to know this. The short answer is; you probably don’t! At least not for the reasons most people seem to want to.
When it comes to swap, the Linux virtual memory manager doesn’t really deal
with ‘programs’. It only deals with pages of memory (commonly 4096 bytes, see
getconf PAGE_SIZE
to find out what your system uses).
Instead, when memory is under pressure and the kernel need to decide what pages
to commit to swap, it will do so according to an LRU (Least Recently Used) algorithm.
Well, that’s a gross oversimplification. There is a lot of magic going on
there which I won’t pretend to know in any greater detail. There is also
mlock()
and madvise()
which developers can use to influence these things.
But in essence - the VMM will amongst other things deliberately put pages of memory which are infrequently used to swap, to ensure that frequently accessed memory pages are kept in the much faster, residential memory.
So if you landed on my original article, wanting to find out what program was misbehaving by seeing what was in swap, like many appear to have done, please reconsider your troubleshooting strategy! Besides, a memory page being in swap isn’t a problem in and of itself - it’s only when those pages are shifted from RAM onto disk and back that you experience a slowdown. Pages that were swapped out once, and remain there for the lifetime of the process it belongs to doesn’t contribute to any noticeable issues.
It’s also worth noting, that once a page has been swapped out to disk, it will not be swapped back into RAM again until it need accessing by the process. Therefore, pages being in swap may be an indicator of a previous memory pressure issue, rather than one currently in progress.
So how do you know if you’re “swapping” ?
Most of the time you can just tell, you’ll notice.
But for more modest swapping; the command sar -B 1 20
will print
out some statistics each second for 20 seconds.
If you observe the second and thrid column, you will see how many
pages you swap in and out respectively. If this number is 0 or near 0, you’re
unlikely to notice any issues due to swapping.
Not everyone has sar installed - so another command you can run is
vmstat 1 20
and look at the si
and so
columns for Swap In and Swap Out.
So in summary;
top
can show you how much swap a process is usingIt was little more than running sar
in a loop, extract some values and
run another command if certain thresholds were exceeded.
Hardly anything that you’d think would result in a reboot.
After whittling down the oneliner to the offending command, it turned out that
sar
was the culprit.
Some further debugging revealed that sar merely spawns a process called
sadc
, which does the actual heavy lifting.
In certain circumstances, if you send SIGINT (ctrl+c, for example) to sar, it
can exit before sadc has done its thing.
When that happens, sadc becomes an orphan, and /sbin/init being a good little init system, takes
it under its wing and becomes its parent process.
When sadc
receives the SIGINT signal, it’s signal handler will pass it up to its parent process… You see
where this is going, right?
Yep, /sbin/init gets the signal, and does what it should do. Initiates a reboot.
If you want to reboot an Ubuntu 14.x server, simply run this in a terminal (as root, this is NOT a DoS/vulnerability, merely a bug):
1 2 3 4 5 6 7 8 9 |
|
Rapidly hitting ctrl+c twice does the trick.
Obviously this command doesn’t make sense to run in isolation, but the bug was
hit in the context of a more involved oneliner, and being in a subprocess seem
to trigger it more often.
You may need to run it a couple of times as a few
things need to line up for it to happen. The above command reboots the server
like 8-9/10 times.
If executed in another subshell, you only need to hit ctrl+c once to trigger it.
A more unrealistic, but sure-fire way to trigger it looks like this:
1 2 3 4 5 |
|
Basically killing sar forcefully (thus orphaning sadc), and then send SIGINT to sadc. This has a 100% success rate.
This was fixed in 2014, but Canonical has neglected to backport it.
A colleague of mine, who is a much better OSS citizen than I am, has
raised this with Canonical
I only tested this on Ubuntu 14.04 and 14.10. Debian and RedHat/CentOS does not appear to suffer from this. It’s surprising that it’s still present in Ubuntu Trusty, since this is backported in Debian Jessie.
Only on a Friday afternoon…
]]>So I enabled logging in UFW, and soon started seeing these types of entries
1 2 3 |
|
Upon checking which rule actually dropped the packets (iptables -L -nv
), it
transpired that the culprit was
1 2 |
|
It turns out that a
change in the 3.18 kernel and onwards means
that unless either of the nf_conntrack_pptp
or nf_conntrack_proto_gre
modules are loaded, any GRE packets will be marked as INVALID, as opposed to
NEW and subsequently ESTABLISHED.
So in order to get openvswitch working with UFW, there are two solutions; Either explicitly allow protocol 47, or load one of the aforementioned kernel modules.
Should you go for the former solution, this is the rule you need to beat to the punch:
1 2 3 4 |
|
with -A ufw-before-input -p 47 -i $iface -j ACCEPT
docker-storage-setup
without root fs being on LVM by
passing DEVS and VG environment variables to the script or editing
/etc/sysconfig/docker-storage-setup
I stumbled across this article the other day ‘Friends Don’t Let Friends Run Docker on Loopback in Production’
I also saw this bug being raised, saying docker-storage-setup doesn’t work with the Fedora 22 cloud image, as the root fs isn’t on LVM.
I decided to try this out, so I created some block storage and a Fedora 22 VM on the Rackspace cloud:
1 2 3 4 5 6 7 8 |
|
Once on the machine, I followed the article above:
1 2 3 |
|
And here’s where the bug report I linked earlier comes into play.
docker-storage-setup
is just a bash script, and if you just take a look at this
output:
1 2 3 4 5 6 7 |
|
It sure gives the impresson of only doing one single thing - growing the root FS! As the bug rightly points out, the Fedora cloud image doesn’t come with LVM for the root FS (which is a good thing!), so there’s no VG for this script to grow.
So unless you read the
script, or the manpage, you wouldn’t necessarily notice that what
--help
says is just the default behaviour, and you can use
docker-storage-setup
to use an emphemeral disk and leave the root fs alone.
The kicker lies in two environment variables (as opposed to
arguments to the script itself, like is more common); $DEVS
and $VG
.
If you supply both of those, and the disk you give in DEVS has no partition
table and the VG you supply doesn’t exist, the script will partition the disk
and create all the necessary bits for LVM on that disk:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
|
So now the script has created the LV thinpool, and written the required docker configuration.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
No trace of /dev/loop0! And to verify that it’s actually using our thinpool:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
In conclusion - the script could definitely do with being updated to using command line arguments for this, rather than environment variables, and update the –help output to highlight this.
]]>Consider the following code (abridged and simplified):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
Pretty straight forward - reads directories, prints out them and the files within them. Now here’s the kicker:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
No traversal?
After a bit of head scratching, and a few debug statements, I found that when using readdir(3) on XFS, dirent->d_type is always 0! No matter what type of file it is. This means that line #25 can never be true.
To be fair though, the manpage states that POSIX only mandates dirent->d_name.
So to be absolutely sure your directory traversal code is more portable, make use of stat(2) and the S_ISDIR() macro!
]]>But if you do decide to write an application which takes a password (or any other sensitive information) on the command line, you can prevent other users on the system from easily seeing it like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
A sample run looks like this:
1 2 3 4 5 |
|
In another terminal:
1 2 3 |
|
In the interest of brevity, the above code isn’t very portable - but it works on Linux and hopefully the point of it comes across. In other environments, such as FreeBSD, you have the setproctitle() syscall to do the dirty work for you. The key thing here is the overwriting of argv[1] Because the size of argv[] is allocated when the program starts, you can’t easily obfuscate the length of the password. I say easily - because of course there is a way.
]]>It turns out that the grass actually IS greener.
Tonight I stumbled upon this. It’s a patched version of freetype. For what I assume are political reasons (free as in speech), Fedora ships a Freetype version without subpixel rendering. These patches fixes that and other things.
With a default configuration file of 407 lines, it’s quite extensible and configurable as well. Lucky, I quite like the default!
If you’re not entirely happy with the way your fonts look on Fedora - it’s well worth a look
]]>They reported back that everything seemed fine, and we went off to do the actual migration. Everything went according to plan and things seemed well. After a while they started seeing some discrepancies in the stock portion of their application. The data didn’t add up with what they expected and stock levels seemed surprisingly high. A crontabbed program was responsible for periodically updating the stock count of products, so this was of course the first place I looked. I ran it manually and looked at its output; it was very verbose and reported some 2000 products had been updated. But looking at the actual DB, this was far from the case.
Still having the test environment available, I ran it a few times against that and could see the com_update
and com_insert
counters being incremented, so I knew the queries were making it there. But the data remained intact. At this point, I had a gut feeling what was going on.. so to confirm this, I enabled query logging to see what was actually going on. It didn’t take me long to spot the problem. On the second line of the log, I saw this:
1
|
|
The program responsible for updating the stock levels was a python script using MySQLDB. I couldn’t see any traces of autocommit being set explicitly, so I went on assuming that it was off by default (which turned out to be correct). After adding cursor.commit()*
after the relevant queries had been sent to the server, everything was back to normal as far as stock levels were concerned.
Since the code itself was seeing its own transaction, calls such as cursor.rowcount
which the testers had relied on were all correct.
But the lesson here; when testing your software from a database point of view, don’t blindly trust what your code tells you it’s done, make sure it’s actually done it by verifying the data! A lot of things can happen to data between your program and the platters. Its transaction can deadlock and be rolled back, it can be reading cached data, it can get lost in a crashing message queue, etc.
As a rule of thumb, I’m rather against setting a blanket autocommit=1
in code, I’ve seen that come back to haunt developers in the past. I’m a strong advocate for explicit transaction handling.
Have you ever logged in to a server, ran free
, seen that a bit of swap is used and wondered what’s in there? It’s usually not very indicative of anything, or even overly helpful knowing what’s in there, mostly it’s a curiosity thing.
Either way, starting from kernel 2.6.16, we can find out using smaps which can be found in the proc filesystem. I’ve written a simple bash script which prints out all running processes and their swap usage. It’s quick and dirty, but does the job and can easily be modified to work on any info exposed in /proc/$PID/smaps If I find the time and inspiration, I might tidy it up and extend it a bit to cover some more alternatives. The output is in kilobytes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
This will need to be ran as root for it to be able to gather accurate numbers. It will still work even if you don’t, but it will report 0 for any processes not owned by your user. Needless to say, it’s Linux only. The output is ordered alphabetically according to your locale (which admittedly isn’t a great thing since we’re dealing with numbers), but you can easily apply your standard shell magic to the output. For instance, to find the process with most swap used, just run the script like so:
1
|
|
Don’t want to see stuff that’s not using swap at all?
1
|
|
…; and so on and so forth
]]>The only somewhat useful example of using Cassandra with C++ one can find online is this, but due to the API changes, this is now outdated (it’s still worth a read).
So in the hope that nobody else will have to spend the better part of a day piecing things together to achieve even the most basic thing, here’s an example which works with Cassandra 0.7 and Thrift 0.6.
First of all, create a new keyspace and a column family, using cassandra-cli:
1 2 3 4 5 6 7 8 9 10 11 |
|
Now go to the directory where you have cassandra installed and enter the interface/ directory and run: thrift -gen cpp cassandra.thrift
This will create the gen-cpp/
directory. From this directory, you need to copy all files bar the Cassandra_server.skeleton.cpp one to wherever you intend to keep your sources.
Here’s some example code which inserts, retrieves, updates, retrieves and deletes keys:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
|
Say we’ve called the file cassandra_example.cpp, and you have the files mentioned above in the same directory, you can comile things like this:
1 2 3 4 5 6 7 8 9 10 |
|
Another thing worth mentioning is Padraig O'Sullivan’s libcassandra, which may or may not be worth a look depending on what you want to do and what versions of Thrift and Cassandra you’re tied to.
]]>So far so good! A week or so later, we often get the call “Our page load time is higher now than before the upgrade! We’ve got twice as much hardware, and it’s slower! You have broken it!” It’s easy to see where they’re coming from. It makes sense, right?
That is until you factor in the newly introduced network topology! Today it’s not unusual (that’s not to say it’s acceptable or optimal) for your average wordpress/drupal/joomla/otherspawnofsatan site to run 40-50 queries per page load. Quite often even more!
Based on a tcpdump session of a reasonably average query (if there is such a thing), connecting to a server, authenticating, sending a query and receiving a 5 row result set of 1434 bytes yields 25 packets being sent between my laptop and a remote DB server on the same wired, non-congested network. A normal, average latency of TCP/IP over Ethernet is ~0.2 ms for the size of packets we’re talking here.
So, doing the maths, you’re seeing 25*0.2*50=250ms
in just network latency per page load for your SQL queries. This is obviously a lot more than you see over a local UNIX socket.
This is inevitable, laws of physics. It is nothing you, your sysadmin and/or your hosting company can do anything about. There may however be something your developer can do about the amount of queries! You also shouldn’t confuse response-times with availability. Your response times may be slower, but you can (hopefully) serve a lot more users with this setup!
Sure, there are technologies out there which have considerably less latency than ethernet, but they come with quite the price-tag, and there are more often than not quite a few avenues to go down before it makes sense to start looking at that kind of thing.
You could also potentially looking at running the full stack on both machines using master/master replication for your DBs, and load balance your front-ends and have them both read locally, but only write to one node at a time! That kind of DB scenario is something fairly easily set up using mmm for MySQL. But in my experience, this often ends up more costly and potentially introducing more complexities than it solves. I’m an avid advocate for keeping server roles separate as much as possible!
]]>This mode of replication is called semi synchronous due to the fact that it only guarantees that at least one of the slaves have written the transaction to disk in its relay log, not actually committed it to its data files. It guarantees that the data exists by some means somewhere, but not that it’s retrievable through a MySQL client.
Semi sync is available as a plugin, and if you compile from source, you’ll need to do –with-plugins=semisync….
So far, the semisync plugin can only be built as a dynamic module, so you’ll need to install it once you’ve got your instance up and running. To do this, you do as with any other plugin:
1 2 |
|
You might get an 1126 error and a message saying “Can’t open shared library..”, then you most likely need to set the plugin_dir variable in my.cnf and give MySQL a restart. If you’re using a master/slave pair, you obviously won’t need to load both modules as above. You load the slave one on your slave, and the master one on your master. Once you’ve done this, you’ll have entries for these modules in the mysql.plugin table. When you have confirmed that you do, you can safely add the pertinent variables to your my.cnf, the values I used (in addition to the normal replication settings) for my master/master sandboxes were:
1 2 3 4 5 6 7 |
|
Note that you probably won’t want to use these values for _trace_level in production due to the verbosity in the log! I just enabled these while testing. Also note that the timeout is in milliseconds. You can also set these on the fly with SET GLOBAL (thanks Oracle!), just make sure the slave is stopped before doing this, as it needs to be enabled during the handshake with the master for the semisync to kick in.
The timeout is the amount of time the master will lock and wait for a slave to acknowledge the write before giving up on the whole idea of semi synchronous operation and continue as normal. If you want to monitor this, you can use the status variable Rpl_semi_sync_master_status which is set to Off when this happens. If this condition should be avoided altogether, you would need to set a large enough value for the timeout and a low enough monitoring threshold as there doesn’t seem to be a way to force MySQL to wait forever for a slave to appear.
If you’re running an automated failover setup, you’ll want to set the timeout higher than your heartbeat, so ensuring no committed data is lost. Then you might also want to set the timeout considerably lower initially on the passive master so that you don’t end up waiting on the master we know is unhealthy and have just failed over from.
Before implementing this in production, I would strongly recommend running a few performance tests against your setup as this will slow things down considerably for some workloads. Each transaction has to be written to the binlog, read over the wire and written to the relay log, and then lastly flushed to disk before each DML statement returns. You will almost definitely benefit in batching up queries into larger transactions rather than using the default auto commit mode as this will increase the frequency of the steps. Update: Even though the manual clearly states that the event has to be flushed to disk, this doesn’t actually appear to be the case (see comments). The above still stands, but the impact may not be as great as first thought
When I find the time, I will run some benchmarks on this.
Lastly, please note that this is written while MySQL 5.5 is still in release candidate stage, so while unlikely, things are subject to change. So please be mindful of this in future comments.
]]>I got most of it set up, and got started on writing up the glusterfs Puppet module. Fairly straight forward, a few directories, configuration files and a mount point. Then I came to the Service declaration, and of course we want this to be running at all times, so I went on and wrote:
1 2 3 4 5 6 |
|
expecting glusterfsd to be running shortly after I purposefully stopped it. But it didn’t. So I dove into puppet (Yay Ruby!) and deduced that the way it determines whether something is running or not is the return code of: /sbin/service servicename status
So a quick look in the init script which ships with glusterfs-server shows that it calls the stock init function “status” on glusterfsd, which is perfectly fine, but then it doesn’t exit with the return code from this function, it simply runs out of scope and exits with the default value of 0.
So to get around this, I made a quick change to the init script and used the return code from the “status” function (/etc/rc.d/init.d/functions on RHEL5) and exited with $?, and Puppet had glusterfsd running within minutes.
I couldn’t find anything when searching for this, so I thought I’d make a note of it here.
]]>1 2 3 4 5 |
|
4 is obviously quite a high score for an email whose only flaw is being in HTML. But FH_DATE_PAST_20XX caught my eye in all of the outputs. So to the rule files:
1 2 3 4 5 |
|
Aha. This is a problem. With 50_scores.cf containing this:
1 2 |
|
there’s no wonder emails are getting dropped! I guess this is a problem one can expect when running a distribution with packages 6 years old and neglect to frequently (or at least every once in a while) update the rules!
Luckily, this rule is gone altogether from RHEL6’s version of spamassassin.
]]>It’s relatively simple to set up, and configuration can be done in two different ways. You can use the supplied cgset command, or if you’re accustomed to doing it the usual way when dealing with kernel settings, you can simply echo values into the pseudo-files under the control group.
Here’s a controlgroup in action:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
The first line shows that the shell which launches the app is under the control of the cgroup group1, so subsequently all it’s child processes are subject to the same restrictions.
As you can also see, the initial memory limit in the group is 512M. Alloc is a simple C app I wrote which calloc()s 612M of RAM (for demonstrative purposes, I’ve disabled swap on the system altogether). At the first run, the kernel kills the process in the same way it would if the whole system had run out of memory. The kernel message also indicates that the control group ran out of memory, and not the system as a whole:
1 2 3 4 |
|
Unfortunately it doesn’t indicate which cgroup the process belonged to. Maybe it should?
cgroups doesn’t just give you the ability to limit the amount of RAM, it has a lot of tuneables. You can even set swappiness on a per-group basis! You can limit the devices applications are allowed to access, you can freeze processes as well as tag outgoing network packets with a class ID, in case you want to do shaping or profiling on your network! Perfect if you want to prioritise SSH traffic over anything else, so you can comfortably worked even when your uplink is saturated. Furthermore, you can easily get an overview of memory usage, CPU accounting etc. of applications in any given group.
All this means you can clearly separate resources and to quite a large extent ensure that some applications won’t starve the whole system, or each other from resources. Very handy, no more waiting for half an hour for the swap to fill up and OOM to kick (and often chose the wrong PID) in when customer’s applications have run astray.
A much welcomed addition to RHEL!
]]>This turned out to be a real head scratcher for me, and initially I thought the problem was something else as Xen wasn’t being very helpful with error messages.
Hopefully there’ll be an update for this soon!
]]>Unfortunately I set about this task on an RHEL 5.4 box, and it hasn’t been a walk in the park. Quite a few dependencies were out of date or didn’t exist in the repositories, libicu, boost, onig, tbb etc.
Though, CMake did a good job of telling me what was wrong, so it wasn’t a huge
deal, I just compiled the missing pieces from source and put them in
$CMAKE_PREFIX_PATH
. One thing CMake didn’t pick up on however, was that the
flex version shipped with current RHEL is rather outdated. Once I thought I had
everything configured, I set about the compilation, and my joy was swiftly
abrupted by this:
1 2 3 |
|
Not entirely sure what it was actually doing here, I took the shortcut of
replacing /usr/bin/flex with a shell script which just exited after putting $@
in a file in /tmp/ and re-ran make
. Looking in the resulting file, this is
the argument flex is given:
1 2 3 |
|
To me that looks quite valid, and there’s certainly no single – in that command line.
Long story short, flex introduced –header-file
in a relatively “recent” version
(2.5.33 it seems, but I may be wrong on that one, doesn’t matter). Unlike most
other programs (using getopt), it won’t tell you Invalid option
‘–header-file’
. So after compiling a newer version of flex, I was sailing
again.
So here’s a small tip; ensure that your developers know what they are doing! It will save you a lot of hassle and money in the long run.
Without having made a science out of it, I can confidently say that at the very least 95% of the downtime I see on a daily basis is due to faulty code in the applications running on our servers.
So after you’ve demanded dual power feeds to your rack, bonded NICs and a gazillion physical paths to your dual controller SAN, it would make sense to apply the same attitude towards your developers. After all, they are carbon based humans and are far more likely to break than your silicon NIC. Now unfortunately it is not as simple as “if I pay someone a lot of money and let them do their thing, I will get good solid code out of it”, so a great deal of due diligence is required in this part of your environment as well. I have seen more plain stupid things coming from 50k pa. people than I care to mention, and I have seen plain brilliant things coming out of college kids’ basements.
This is important not only from an availability point of view, it’s also about running cost. The amount of hardware in our data centers which is completely redundant, and would easily be made obsolete with a bit of code and database tweaking is frightening. So you think you may have cut a great deal when someone said they could build your e-commerce system in 3 months for 10k less than other people have quoted you. But in actual fact, all you have done is got someone to effectively re-brand a bloated, way too generic, stock framework/product which the developer has very little insight into and control over. Yes, it works if you “click here, there and then that button”, the right thing does appear on the screen. But only after executing hundreds of SQL queries, looking for your session in three different places, done four HTTP redirects, read five config files and included 45 other source files. Needless to say, those one-off 10k you think you have saved, will be swallowed in recurring hardware cost in no time. You have probably also severely limited your ability to scale things up in the future.
So in summary, don’t cheap out on your development but at the same time don’t think that throwing money at people will make them write good code. Ask someone else to look things over every now and then, even if it will cost you a little bit. Use the budget you were planning on spending on the SEO consultant. Let it take time.
]]>