Sysadmin 101: Troubleshooting

I typically keep this blog strictly technical, keeping observations, opinions and the like to a minimum. But this, and the next few posts will be about basics and fundamentals for starting out in system administration/SRE/system engineer/sysops/devops-ops (whatever you want to call yourself) roles more generally.
Bear with me!

“My web site is slow”

I just picked the type of issue for this article at random, this can be applied to pretty much any sysadmin related troubleshooting. It’s not about showing off the cleverest oneliners to find the most information. It’s also not an exhaustive, step-by-step “flowchart” with the word “profit” in the last box. It’s about general approach, by means of a few examples.
The example scenarios are solely for illustrative purposes. They sometimes have a basis in assumptions that doesn’t apply to all cases all of the time, and I’m positive many readers will go “oh, but I think you will find…” at some point.
But that would be missing the point.

Having worked in support, or within a support organization for over a decade, there is one thing that strikes me time and time again and that made me write this;
The instinctive reaction many techs have when facing a problem, is to start throwing potential solutions at it.

Read on →
Feb 16th, 2017

Flask - Cookie and Token Sessions Simultaneously

Dealing with sessions in Flask applications is rather simple! There is plenty of choice in pre-rolled implementations that is more or less plug-and-play.

However, sometimes you may want (or need) to colour outside the lines, where a cookie-cutter implentation either doesn’t work, or gets in the way.

One such scenario is if you have an application which needs to act both as a web front-end, with your typical cookie-based sessions, as well as an API endpoint. Requiring cookies when you’re acting as an API endpoint isn’t particularly nice, tokens in the request header is the way to go! So how can you get Flask sessions to work with both these methods of identification?

Read on →
Jan 6th, 2017

Swap Usage - 5 Years Later

Skip to the end for a TL;DR

I’ve neglected this blog a bit in the last year or so. I’ve written a lot of documentation and given a lot of training internally at work, so there hasn’t been an enormous amount of time I’ve been able or willing to spend on it. However;

Five years ago, I wrote an article presenting a script which lets you find what processes have pages in your swap memory, and how much they consume. This article is still by far the most popular one I’ve ever written, and it still sees a fair amount of traffic, so I wanted to write a bit of an updated version, and fill in some of the things I probably should have mentioned more in depth in the original one.

Let’s get one thing out of the way first - the script in the original article is now redundant. It actually already was redundant in certain cases when I wrote about it, but now it’s really redeundant. To get the same information in any currently supported distribution, simply launch top, press f, scroll down to where it says SWAP press space followed by q. There we go, script redundant.

With that cleared up - what I didn’t mention five years ago is why you would want to know this. The short answer is; you probably don’t! At least not for the reasons most people seem to want to.

When it comes to swap, the Linux virtual memory manager doesn’t really deal with ‘programs’. It only deals with pages of memory (commonly 4096 bytes, see getconf PAGE_SIZE to find out what your system uses).

Instead, when memory is under pressure and the kernel need to decide what pages to commit to swap, it will do so according to an LRU (Least Recently Used) algorithm. Well, that’s a gross oversimplification. There is a lot of magic going on there which I won’t pretend to know in any greater detail. There is also mlock() and madvise() which developers can use to influence these things.

But in essence - the VMM will amongst other things deliberately put pages of memory which are infrequently used to swap, to ensure that frequently accessed memory pages are kept in the much faster, residential memory.

So if you landed on my original article, wanting to find out what program was misbehaving by seeing what was in swap, like many appear to have done, please reconsider your troubleshooting strategy! Besides, a memory page being in swap isn’t a problem in and of itself - it’s only when those pages are shifted from RAM onto disk and back that you experience a slowdown. Pages that were swapped out once, and remain there for the lifetime of the process it belongs to doesn’t contribute to any noticeable issues.

It’s also worth noting, that once a page has been swapped out to disk, it will not be swapped back into RAM again until it need accessing by the process. Therefore, pages being in swap may be an indicator of a previous memory pressure issue, rather than one currently in progress.

So how do you know if you’re “swapping” ?

Most of the time you can just tell, you’ll notice. But for more modest swapping; the command sar -B 1 20 will print out some statistics each second for 20 seconds. If you observe the second and thrid column, you will see how many pages you swap in and out respectively. If this number is 0 or near 0, you’re unlikely to notice any issues due to swapping. Not everyone has sar installed - so another command you can run is vmstat 1 20 and look at the si and so columns for Swap In and Swap Out.

So in summary;

  • top can show you how much swap a process is using
  • A process using swap isn’t necessarily (or even is rarely) a badly behaved process
  • Swap isn’t inherently bad, it’s only bad when it’s used frequently
  • The presence of pages in swap doesn’t necessarily indicate a current memory resource issue
Dec 19th, 2016

Sar Rebooting Ubuntu

Today I had a colleague approach me about a oneliner I sent him many months ago, saying that it kept rebooting a server he was running it on.

It was little more than running sar in a loop, extract some values and run another command if certain thresholds were exceeded. Hardly anything that you’d think would result in a reboot.

After whittling down the oneliner to the offending command, it turned out that sar was the culprit. Some further debugging revealed that sar merely spawns a process called sadc, which does the actual heavy lifting.

In certain circumstances, if you send SIGINT (ctrl+c, for example) to sar, it can exit before sadc has done its thing.
When that happens, sadc becomes an orphan, and /sbin/init being a good little init system, takes it under its wing and becomes its parent process.

When sadc receives the SIGINT signal, it’s signal handler will pass it up to its parent process… You see where this is going, right?
Yep, /sbin/init gets the signal, and does what it should do. Initiates a reboot.

Read on →
Oct 30th, 2015

GRE Tunnels and UFW

Today I wrote an Ansible playbook to set up an environment for a docker demo I will be giving shortly. In the demo I will be using three hosts, and I want the containers to be able to speak to each other across hosts. To this end, I’m using Open vSwitch. The setup is quite straight forward, set up the bridge, get the meshed GRE tunnels up and off you go.
I first set this up in a lab, with firewalls disabled. But knowing that I will give the demo on public infrastructure, I still wrote the play to allow everything on a particular interface (an isolated cloud-network) through UFW.
When I ran my playbook against a few cloud servers, the containers couldn’t talk to each other on account of the GRE tunnels not working.

Read on →
Sep 14th, 2015

LVM Thinpool for Docker Storage on Fedora 22

TL;DR: You can use docker-storage-setup without root fs being on LVM by passing DEVS and VG environment variables to the script or editing /etc/sysconfig/docker-storage-setup

I stumbled across this article the other day ‘Friends Don’t Let Friends Run Docker on Loopback in Production’

I also saw this bug being raised, saying docker-storage-setup doesn’t work with the Fedora 22 cloud image, as the root fs isn’t on LVM.

Read on →
Sep 8th, 2015

Readdir and Directories on Xfs

Recently I had some pretty unexpected results from a piece of code I wrote quite a while ago, and never had any issues with. I ran my program on a brand new CentOS 7 installation, and the results weren’t at all what I was used to!

Consider the following code (abridged and simplified):

readdir_xfs.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <stdio.h>
#include <dirent.h>
#include <sys/types.h>

void recursive_dir(const char *path){

  DIR *dir;
  struct dirent *de;

  if (!(dir = opendir(path))){
    perror("opendir");
    return;
  }
  if (!(de = readdir(dir))){
    perror("readdir");
    return;
  }

  do {

    if (strncmp (de->d_name, ".", 1) == 0 || strncmp (de->d_name, "..", 2) == 0) {
      continue;
    }

    if (de->d_type == DT_DIR){
      char full_path[PATH_MAX];
      snprintf(full_path, PATH_MAX, "%s/%s", path, de->d_name);
      printf("Dir: %s\n", full_path);
      recursive_dir(full_path);
    }
    else {
      printf("\tFile: %s%s\n", path, de->d_name);
    }
  } while (de = readdir(dir));
  closedir(dir);

}

int main(int argc, char *argv[]){

  if (argc < 2){
    fprintf(stderr, "Usage: %s <dir>\n", argv[0]);
    return 1;
  }

  recursive_dir(argv[1]);
  return 0;
}
Read on →
Jan 16th, 2015

How Does MySQL Hide the Command Line Password in Ps?

I saw this question asked today, and thought I’d write a quick post about it. Giving passwords on the command line isn’t necessarily a fantastic idea - but you can sort of see where they’re coming from. Configuration files and environment variables are better, but just slightly. Security is a night mare!

But if you do decide to write an application which takes a password (or any other sensitive information) on the command line, you can prevent other users on the system from easily seeing it like this:

Read on →
Mar 10th, 2012

Font Rendering - No More Jealousy

I suppose this kind of content is what most people use twitter for these days. But since I’ve remained strong and stayed well away from that, I suppose I will have to be a tad retro and write a short blog post about it. If you like me are an avid Fedora user, I’m sure you’ve thrown glances at colleague’s or friend’s Ubuntu machines and thought that there was something that was slightly different about the way it looked (aside from the obvious Gnome vs Unity differences). Shinier somehow…; So had I, but I mainly dismissed it as a case of “the grass is always greener…”.

It turns out that the grass actually IS greener.

Read on →
Feb 28th, 2012

Transactions and Code Testing

A little while ago I worked with a customer to migrate their DB from using MyISAM to InnoDB (something I definitely don’t mind doing!) I set up a smaller test instance with all tables using the InnoDB engine as part of the testing. I instructed them to thoroughly test their application against this test instance and let me know if they identified any issues.

They reported back that everything seemed fine, and we went off to do the actual migration. Everything went according to plan and things seemed well. After a while they started seeing some discrepancies in the stock portion of their application. The data didn’t add up with what they expected and stock levels seemed surprisingly high. A crontabbed program was responsible for periodically updating the stock count of products, so this was of course the first place I looked. I ran it manually and looked at its output; it was very verbose and reported some 2000 products had been updated. But looking at the actual DB, this was far from the case.

Read on →
Aug 18th, 2011