Sysadmin #1 - Things never work first time, or the second time...

29 April 2018 10:00 pm

I started my journey into playing Linux systems admin sometime in 1997 when I bought a 486 PC to use as a firewall. I remember the event reasonably clearly; I made the distinct choice to purposefully spend some money on a second hand 486, with the intention of using it just to run Linux and manage my dialup modem connection, so it could be shared between my main PC and an old PC my dad was using. Mostly though it was because networking computers is cool, and having two means it's twice as cool right?

The PC in the middle was my firewall. I think it ran Slackware Linux and a lot of hand-crafted ipmasquerading magic.

Back then there was no Google, we were still in the days of urls that looked like this http://www.users.globalnet.co.uk/~kermit/ (which was my first ever website. Bizarrely it still sort of exists, and still contains the last data I put into it, even though I stopped using that ISP about 20 years ago! Follow the chains of redirection to get back here!). Finding anything online required knowing where to go, it was quite difficult. Mailing lists ruled the day - if you wanted to know something, you found a mailing list and joined it. Imagine StackOverflow, but by email. It even had spam! This isn't some Monty Python sketch, it really was like that. Finding information took time and the understanding of boolean operators.

Of course, if the thing you're trying to learn is "how do I make my Linux box dial the Internet", and going on the Internet costs you 5p per minute, and dribbles down the phone line at 33.6k/sec you don't spend much time online reading. Fortunately the Linux distributions contained lots of useful HOWTO documents on the CDs they came on. Straight forward articles and lists of instructions telling you how to do specific things. Want to set up a three button mouse? how about beating Sendmail into submission? There was a HOWTO for most things, and all an aspiring systems admin had to do was read and follow, type and read, maybe read the thing first before doing it, or just blindly follow the instructions and hope nothing went wrong. My 17 year old self loved it, I learn by reading and doing and it felt like an extension of diligently copying out BASIC listings on my ZX Spectrum back in the 80s. Except instead of some crap appearing on a screen I got The Internet in my bedroom.

The thing is, it's all well and good following a HOWTO if the thing you're trying to do is common and well known. It's not so easy when you want to do something a bit different. That required lots of lateral thinking and many late nights trying things repeatedly until they worked.

It's this. Every. Single. Damn. Time.

Sometimes I wonder if some poor sod's last ever words were "I'll check on my machine when I get home"...

So like Grandpa Simpson, I'll just curve back round to the topic of this post and try to focus my attention on what I've done this week. Stick with it, I'm going somewhere here.

As I said before, in my cellar is a PC running Linux that is my server. It hosts this website, another one, a bunch of internal services and is also responsible for consuming half my electricity bill per month. There's also a NAS down there that in addition to serving a few terabytes of storage, also produces one of the most irritating vibrating noises known to man - cheap Chinese ball bearing fans are not worth buying.

The server just used to have Linux on it, and then a variety of programs installed on the top - Apache, MySQL, Postfix and the like. It's the way I've always done it since those days back in the late 90s where our PCs were about as powerful as a Raspberry Pi. Virtualisation didn't exist, and my modern PC's CPU die contains more RAM than my main PC did. Having everything installed on one machine made a bit of a mess, things were prone to breaking. It was the software equivalent of a carefully balanced shelf - try to move one thing and you risk knocking over many other things.

This is a visual metaphor for a Linux server

About a year or so ago I discovered the Linux Container project, and spent a fair amount of time over several late nights splitting my server of mess into separate instances. For the first time ever I could watch Apache break down and not have it take out my mail. It was an entirely new concept to learn and understand, and over the past year has proven itself to be almost as useful as the cheap UPS I bought. Segmenting the system up made it more robust and less prone to failing entirely. Upgrading software was no longer a game of chance, and I no longer feared rebooting my server.

Recently I got a bit itchy about a lack of backups of my carefully crafted system. I was also a bit bothered that I'm unable to recreate this system from scratch - it's getting too complicated to manage. I honestly can't remember how I set my mail server up, it does many clever things and I forgot what I did. I need a system that would work perfectly at 1am on a Sunday when I receive complaints of "Netflix has stopped working! why does our stuff have to always break?". Somehow I came across Proxmox, which is one of those OpenSource projects that has a subscription business model hiding inside it. Fortunately the subscription part is just for tech support and not aimed at someone with a PC in his cellar trying to make the Internet as complicated as possible.

Look at this, not only does it have lists of interesting menu items down the side, but it has charts and graphs! Underneath is the same LXC stuff that I've used before, there's just a nice management interface stuck to the top. Graphs are nice, they help answer the question of "why is everything so slow?".

Also, it lets you create a cluster of nodes.

The LXC containers can be migrated between the nodes, meaning physical hardware stops being an issue. So to me this is networking (cool), virtualised machines (really cool), that can be moved between physical machines (space age cool). And it runs off a cheap PC I bought for £150 (in 1999 I could buy a 486 for £150. Today I can buy a Core I5 PC for the same price).

So my setup is now the PC in the cellar running Proxmox, and my media PC under the telly also running Proxmox. Both of which connect via NFS to my NAS to do backups and access shared data. The media PC is currently on and running, serving some parts of my network, while my server is still running LXC and serving the other things that I haven't migrated over yet. Eventually everything will run on the server in my cellar and the media PC will only boot proxmox if it's needed as a backup, for the rest of the time the media PC will boot Kodi from a USB stick or remain swiched off.

All of this sounds quite nice, doesn't it? I have a mini cluster of two computers, backups are built into the software and can be scheduled to happen automatically. There's one-click restoration of backups, and the system generally feels robust and like it's going to work well. One day there'll be a hardware failure and I'll wonder why my alarm didn't go off in the morning, attempt to view the web control panel from my tablet and go "huh... I have no wifi wtf?" and after standing in the cellar for a few minutes go and turn the backup server on, and start looking around the house for a spare hard disk (or ebay for an entirely new PC). Which is much better than trying to bodge together enough bits of router config to make DNS continue working while I attempt to Google for the source of the problem. Sometimes the HDDs in my NAS die, it's an annoyance rather than a show-stopping disaster of lost data. Last time it happened, it took me a month to realise. I just went to the shop and bought a new drive.

None of this came as easily as it appears. Not a single bit. Every step has several hours of torturous Google searching hidden behind it. I had to tackle everything from the farcical non booting USB drive (because I wrote it using the wrong software) and "why is syslog empty... oh because the permissions are wrong" to the abstract and weird. NFS crashed the kernel, the PC sat and dribbled at me while the JavaScript management software running in my browser pretended everything was just fine. At one point I broke everything because DNS stopped working, because my internal DNS is held in a MySQL database which had just shat the bed. Fixing this required remembering that "1.1.1.1" is a DNS, random Googling and clicking and then some copy-pasting of SQL into the MySQL console to recreate some missing tables.

All because I'd dared to install a newer version of the software than what I was using previously.

When I first set this DNS up it took me a while to realise I couldn't refer to the MySQL server by hostname, because the DNS can't resolve the name because it's not running yet, but to run it needs to chicken-and-egg-error. Stuff like this keeps you awake at night. I wonder what normal people do on a Sunday evening... poking at Linux probably isn't high on their list of things to do.

Finding all of the answers to these problems wasn't straight forward. My usual tactic is this:

  1. Turn on debug logging
  2. Find the log
  3. Read the log, looking for key words that I've learned over the decades
  4. Put those words in Google
  5. Use years of intuition to weed out the garbage from the worthwhile
  6. Wait for luck to hit, usually somewhere on page 3 or 6 of some random forum
  7. Find the one line of command line code that fixes the issue
  8. repeat.

It seems that every single time I have an issue, the answer does exist. It is catalogued on Google somewhere, but it's rarely on an obvious site you'd go to first. Somehow I can skim read through the pages of people asking very similar questions, and pick out the plausible "try typing this into the console" answers from the garbage. Even more weirdly, I think I quite like the challenge of fixing the problems. It's like a puzzle that never gets boring, and never repeats itself. Steam is full of programming style games to play, but there's never much consequence to failing those. If I bugger this up, I can't get email and important people like to send me emails with important things in them - like "we need your gas meter reading" or "your car tax needs paying".

Yeah, I could ditch all this and have my ISP provided router, a cheap set top box, a small Synology NAS and maybe an Amazon instance for my web stuff... but that'd be boring. Just like this morning could have been boring if the timezone on the LXC instance managing my music player understood it was GMT+1 and I'd woken up an hour too late.

Oh, if you're a random person looking for the correct way to change the timezone on your Ubuntu Server, the magic incantation is sudo dpkg-reconfigure tzdata