Day in the life of a Systems Administrator

Day in the life of a Unix Systems Administrator

Wow, been almost a year since I blogged anything. I’m getting lazy.

So what’s the daily life of a systems administrator like? Here was today:

Plan coming in in the morning: Begin quarterly “Vulnerability audit report”.

What did I do?
Windows server starts alerting on CPU at midnight, again. We fixed the problem on Tues. Why is it alerting again?
Of course it corrects itself before I can get logged in and doesn’t go off again all day. Send email to person responsible for the application on that server to ask if the app was running any unusually cpu intensive jobs. Respond with screenshot showing times CPU alerts went off. Get response of “nothing unusual”. As usual.

We updated the root password on all Unix servers last week. Get a list of 44 systems from coworker that still have the old root password.
Check the list, confirm all still have old root password.
Check the list against systems that were updated via Ansible. All on the Ansible list. No failures when running the Ansible playbook to update the root password. All spot-checks that the new root password was in effect at the time showed task was working as expected.
Begin investigating why these systems still have the old root password.
Speculation during team scrum that Puppet might be resetting the root password.
Begin testing hypothesis that root password was, in fact, changed, but something else is re-setting it back to the old password.
Manually update root password on one host. Monitor /etc/shadow to see if it changes again after setting password. (watch -d ls -l /etc/shadow)
Wait some more.
Wait 27 minutes, BOOM! /etc/shadow gets touched.
Investigate to see if Puppet is the culprit. I know nothing about Puppet. I’m an Ansible guy. The puppet guy (who knows just enough to have set up the server and built some manifests and get Puppet to update root the last time the root password was changed, before I started working here.) is out today.
Look at log files in /var/log. Look at files in /etc/puppet on puppet server. Try to find anything that mentions “passw(or)?d&&root” (did I mention I’m not a puppet guy?). Find a manifest that says something about setting the root password, but it references a variable. Can’t find where the value of that variable is set.
Look some more at the target host. See in log files that it’s failing to talk to the Puppet server, so continuing to enforce the last set of configuration stuff it got. Great, fixing this on the Puppet server won’t necessarily fix all the clients that have been allowed to lose connectivity that no one noticed (entropy can be a bitch.)
Begin looking at what to change on the client (other than just “shut down the Puppet service” and “kill it with fire!”). Realize it’s much faster to surf all the files and directories involved with “mc”.
Midnight Commander not installed. Simple enough, “yum install mc”.
Yum: “What, you want to install something in the base RHEL repo? HAH! Entropy, baby! I have no idea what’s in the base repo.”.
Me: “Hold my beer.” (This is Texas, y’all.)
(No, not really. CTO frowns on drinking during work hours, or drinking while logged into production systems. Or just drinking while logged in…)
OK, so more like:
“Hold my Diet Coke.”
Yum: “Red Hat repos? We don’t need no steeeenking Red Hat repos!”

Start updating Yum repo cache. Run out of space in /var. Discover when this server was built, it was built with much too small a /var. Start looking at what to clean up.
Fix logrotate to compress log files when it rotates them, manually compress old log files.
/var/lib/clamav is one of the larger directories. Oh, look, several failed DB updates that never got cleaned up.
Clean up directory, run freshclam. Gee, clamav DB downloads sure are taking a long time given that it’s got a GigE connection to the local DatabaseMirror. Check Freshclam config. Yup, local mirror is configured… external mirror ALSO configured. Dang it. Fix that. ClamAV DB updates no much faster.
Run yum repo cache update again. Run out of disk space again. Wait… why didn’t Nagios alert that /var was full?
Oh, look, when /var was made a separate partition, no on updated Nagios to monitor it.
Log into Nagios server to update config file for this host. Check changes into Git. Discover there have been a number of other Nagios changes lately that haven’t been checked into Git. Spend half an hour running git status / diff / add / delete / commit / push to get all changes checked into Git repo.
Restart nagios server (it doesn’t like reloads. Every once in a while it goes bonkers and sends out “The sky is falling! ALL services on ALL servers are down! Run for your lives! The End is nigh!” if you try a simple reload.
Hmm… if Nagios is out of date for this host, is Cacti…
Update yum cache again. Run out of disk space again.
Good thing this is a VM, with LVM. Add another drive in vSphere, pvcreate, swing your partner, vgextend, lvresize -r, do-si-do!
yum repo cache update… FINALLY!
What was I doing again? Oh, right, install Midnight Commander…
Why? Oh yeah, searching for a Puppet file for….?
Right, root password override.

Every time I log into a server it seems like I find a half dozen things that need fixing. Makes you not want to log into anything, so you can actually get some work done. Oh, right, entropy…

Ansible and Variables

A basic explanation of Ansible and a discussion of variable usage.

I’ve been talking about Ansible on Facebook lately and the other day a friend asked me about Ansible and variables. I gave her a quick explanation, then told her I’d do a more thorough writeup that would be easier to follow than my “stream of consciousness” explanation given in FB messages.
It occurred to me that I’m planning to do a “lunch and learn” on Ansible at work soon, and I could re-use the same material, so I’ll just post this publicly. I plan for this to be the first in a series on DevOps, integration, idempotent, configuration management and Ansible. So without further ado…

For those who have not seen my posts on Facebook, Ansible is a configuration management tool for provisioning, deploying and configuring, servers and applications. It is one of a series of such tools that have come out in the last few years, such as Puppet, Chef and Saltstack. It is designed to be fast, easy to use, power, efficient and secure. It is serverless and agentless. It aims to be idempotent.

I can’t speak to Puppet, Chef or Saltstack as I’ve never used them.

Addressing these one at a time, not necessarily in the order presented above:

  • Secure
  • Everything is done through SSH tunnels. No passwords, no configuration files, are ever sent over the network in the clear. Set up your SSH keys and you don’t have to worry about typing passwords either.
    There is no agent software running on the managed machines, so there’s nothing to hack.

  • Easy to use
  • “I wrote Ansible because none of the existing tools fit my brain. I wanted a tool that I could not use for 6 months, come back later, and still remember how it worked.”
    Michael DeHaan
    Ansible project founder

  • Efficient
  • No agents, just SSH (or PowerShell with Windows, but I won’t get into that.) The only software required on the managed machine is an SSH daemon and Python.

  • Serverless and Agentless
  • As I’ve already mentioned, there’s no agent running on the managed server. If you can ssh into it and run Python, you’re good to go.
    There is no central server, full of manifests, menus, etc. You can run it from your desktop or laptop. Again, if you have Python, you’re good to go (Python has its own implementation of the OpenSSH client.) Just make sure you back up your playbook and roles. Git is a great place for this!

  • idempotency
  • The is one of the most important! It means you should be able to run your Ansible script against a managed host at any time, and not break it. If anything is not configured the way it is supposed to be, the ansible script will put it back the way it should be. Shell scripts have to be written very carefully to detect if something doesn’t need to be done. It’s also notoriously difficult to modify files with shell scripts (unless you’re really good with tools like sed and awk, or perhaps Perl…)

Some vocabulary before we begin:

  • playbook
  • A file defining which hosts you want to manipulate and what roles you want to apply to those hosts, as well as what tasks you want to run.

  • roles
  • A defined list of tasks to be run when the role is called, as well as any files to be installed, templates to be applied, dependency information, etc.

  • inventory
  • A file listing every server you will manage with Ansible, and what groups they belong to. A host can belong to any number of groups, including none at all, and groups can be members of other groups.

  • host_vars & group_vars
  • Directories with files containing variables specific to certain hosts (host_vars) and host groups (group_vars). These variables are used in your tasks and roles.

Now, on with the discussion of variables. Here was Kathryn’s original question:

How do variables work with dependencies in roles? Meaning, if a role is dependant on another, can it access the variables of the other at run time?

I started to answer with an example we use at work: we have a “common” role that sets up some users with specific UIDs that we want on all our servers, and an “apache” role that depends on that common role (e.g.: it needs the wwww user created by common). Kathryn further asked:

Okay, say “application” depends on “common” and “common” has default variables… would “application” pick up “common”‘s defaults?

Yes! For example, we have in our “common” role, a task with a file which pushes out customized /etc/sudoers.d files, depending on what the server will do, what environment it will be in, etc. One of the tasks looks like this:

NOTE: the language used to write Ansible files, Yaml, is whitespace sensitive, however due to the limitations of HTML and my WordPress config, the whitespace is removed from my examples. Do not just cut and paste and expect it to work. You will need to adjust the leading spacing on all lines.

- name: Sudoers - push sudoers.d/hadoop_conf
template: >
when: hadoop_cluster is defined

Note the last line: “when: hadoop_cluster is defined”. “hadoop_cluster” is a variable. This variable isn’t actually defined in our role, but rather in the playbook, or in a host_var or group_var file. In this case we have a group_vars/all_hadoop file. Any task run on any server that is part of the “all_hadoop” group in the inventory will have the variables defined in this group_var file. This file contains:
# file: group_vars/all_hadoop

hadoop_cluster: true

In this case “hadoop_cluster” is defined, and has a value of “true”. Our task above doesn’t care about the value, only that the variable is defined at all. If I run the above task on the server “namenode1”, and “namenode1” is in a group called “all_hadoop” in my inventory file, it will inherit the variables in group_vars/all_hadoop, “hadoop_cluster” is defined, so the task will be run.
Another role or task, which might be part of “common” role or in a completely different role, will be able to access the same variable and act on it. That role / task might actually care about the value of the role, and would be able to see that value. Or it might just care that the variable is defined.

Another example: I built a role for a set of servers at work. In our development environment we wanted to allow the developers actually writing the code for the applications to run on those servers to be able to use sudo to gain root access. I added another task to the same file as our Hadoop example above:
- name: Sudoers - push sudoers.d/nova_conf
template: >
when: allow_project_sudo is defined

In our inventory, the development servers for this project are in a “dev_project” group, and there’s a group_vars/dev_project file that defines “allow_project_sudo”. We also have a “production_project” group in our inventory which contains the production servers for this project. The “allow_project_sudo” variable is NOT defined in group_vars/production_project, so that sudoers file is not pushed out.

Directly addressing Kathryn’s question about one role being able to call variables “defined” by another role (although I’ve already addressed the fact that roles don’t really “define” variables, they just access them), I have this task:
- name: Build ssh key files
assemble: >
src={{ item.user }}_ssh_keys
dest=/home/{{ item.user }}/.ssh/authorized_keys
owner={{ item.user }}
group={{ }}
- { user: 'projectuser', group: 'projectgroup' }
when: allow_project_sudo is defined

Again, we look to see if “allow_projecgt_sudo” is defined; if so, we build a .ssh/authorized_keys file for the user “projectuser”, allowing all those same devs to ssh into the server as that user. This task also includes the intriguing and useful “with_items”. This allows for a form of looping, such that it will actually perform this task for each item listed in the “with_items” block, redefining the “item.user” and “” variables used in the src, dest, owner and group lines in the task.
We actually define two variables in our “with_items”. Each line in “with_items” is an “item”. In this case we have two variables (basically an associative array), and we can reference the key/value pairs in the array. “item.user” has the value “project user”. “” has the value “projectgroup”. Thus our “assemble” becomes, on the first iteration of “with_items”:

assemble: >

This basically says “grab all the files (presumably ssh key files) in the directory “projectuser_ssh_keys” (stored inside a directory in our role) and build, on the managed host, a file called “authorized_keys” in the directory /home/projectuser/.ssh, make that file owned by projectuser:projectgroup, with -rw——- permissions. Oh, and back up the original file first, just in case.

Manipulating maildirs at the filesystem level

Let’s here it for being able to manipulate you mail directory structure at the file system level and still be able to access it through Thunderbird.


DJBDNS must run as two separate instances to bind to both an IPv4 and IPv6 addresss.

Tip: When patching DJB’s “dnscache” for IPv6, you can’t just tell it to bind to both the IPv4 and IPv6 addresses. You will need to run two separate instances, one binding to the IPv4 address, one binding to the IPv6 address.
I haven’t checked, but I’m betting my tinydns instance is also not binding to both addresses and will have to be run as two separate instances as well.

Fixing Vmware virtual disks

Having hosed a Gentoo guest on a VMware ESXi host by filling the partition (which VMware really doesn’t like) then attempting to fix it by mounting the partition in anther guest and fsck’ing it first, I got the error message “the parent virtual disk has been modified since the child was created” when I tried to boot the original Gentoo guest.
Googling pointed me to a nice post at Recovering VMware snapshot after parent changed.
Step two lists the following caveat:

“Look at the size of the snapshot virtual hard disk. If it is more than 2GB and you’re running a 32-bit OS, or it is more than the amount of memory that you have available, the following method will probably not work. You’re welcome to try though.”

I found this wasn’t an issue as it appears (at least as of ESXi 4.x) VMware has separated the vmdk “header” and “data”, putting the “header” in the “hostname.vmdk” file and the actual data in “hostname-flat.vmdk”. The original vmdk is now only a couple of hundred bytes and easily edited in vi. Grabbing the CID from the Gentoo.vmdk and modifying parentCID in Gentoo000001.vmdk had me back up and running (at least to the point that I could now boot the Gentoo guest, using an Ubuntu ISO so I could access the file system and clean it up. I moved /home to a new partition, fixing the space issue).
Next time, I’ll just be smart and build all systems with LVM, then I can just add more physical extents when I need more space.

VMware ESXi on USB thumb drive

Running Dog Leaugue has a good write up on how to install VMWare ESXi on a thumb drive.
With this I was able to get it up and running on a Dell PowerEdge 850 that would NOT install ESXi from a CD (couldn’t find a storage device to install to).

CentOS domU under Debian

I finally got a CentOS 5 domU running under Debian.
The xen-tools xen-create-image method didn’t work. I managed to find an appropriate build script for centos5, but it was pretty badly out of date, trying to install RPM versions that don’t exist on the mirror servers any more. Trying to bring it back up to date would have been a PITA. It has the RPM versions hard-coded in the script.
However the instructions at worked a treat.
After following those steps, I converted it from a file-based image, to an LVM, with the following steps:
Manually create logical volumes for the filesystem and swap. I use 40G filesystem LVs and 128M swaps.

# mkdir /mnt/loop
# mkdir /mnt/cenots
# mount /home/andrew/centos.5-0.img /mnt/loop -o loop
# mount /dev/mapper/ember-centos5–disk /mnt/centos
# cd /mnt/loop
# cp -Rp bin boot dev etc home lib media mnt opt root sbin selinux srv sys tmp usr var ../centos
# cd
# umount /mnt/loop
# umount /mnt/centos

Then edit /etc/xen/domains/centos.cfg and change the following lines:

kernel = “/boot/vmlinuz-2.6.18-4-xen-686”
ramdisk = “/boot/initrd.img-2.6.18-4-xen-686”
vif = [‘bridge=xenbr0’]
disk = [‘file:/xens/name_of_new_server_to_be/centos.5-0.img,sda1,w’,’file:/xens/name_of_new_server_to_be/centos.swap,sda2,w’]


kernel = ‘/boot/vmlinuz-2.6.18-6-xen-686’
ramdisk = “/boot/initrd.img-2.6.18-6-xen-686”
vif = [ ‘ip=’ ]
disk = [ ‘phy:ember/centos5-disk,sda1,w’, ‘phy:ember/centos5-swap,sda2,w’ ]

Then “xm create centos”. Boom! Centos 5, running as a domU on a Debian Etch dom0, from a logical volume.
And I still have the original centos5 image file for creating fresh domUs.

Xen and the art of server maintenance

Aught to be a good title for a book on Xen, no?

Anyway, while discussing Xen with the COO (and it just occurred to me, really this project should be the CTO’s, not the COOs… odd how the COO does all this stuff…) he came to the conclusion that, like openVZ and Virtuozzo, Xen guest systems shared the kernel with the Host. That didn’t sound right to me, but I couldn’t disprove it with my Xen server, where every DomU had an empty /boot.

So I updated the kernel in Dom0, but didn’t reboot. I now have a newer kernel installed than the one it’s currently running.
I then tweaked the /etc/xen-tools/xen-tools.conf and built a new DomU, to use the new kernel. Everything went without a hitch. I now have a Dom0 running 2.6.18-4-xen-686, with a domU running 2.6.18-6-xen-686. So it would seem that while they all “share” a kernel in the sense that they share a single install on the hard drive (all pulling from the dom0 /boot directory), they aren’t sharing a single instance of the kernel in memory.

I then tried to get a working CentOS 5 domU running, but ran into some snags. That will be another post.

I’ve been a busy little geek

So far this week I’ve:
Finally gotten a working Xen system that will boot a Debian guest.
Successfully installed ispCP on the Debian guest.
Built another Debian guest to be an OpenVPN server.
Successfully built an OpenVPN server and got two clients to connect from outside the network, through the DSL modem/router.
Correctly configured the VPN server to give the client access to the full network via IP masquerading (next trick: get the network to simply route the packets instead of having to use masq).
Got ddclient working on the VPN server to keep dyndns updated so I don’t have to hard code an IP address in my VPN clients and check various server log files to see if it changed.
Fixed ddclient, when it failed to update dyndns with new IP address after my DSL provider mysteriously issued a new one, not 3 hours after setting up ddclient in the first place.

I can now log into my ispcp box from my desk at work, as though it was on the same network. I can now proceed with trying to get Mailman to play nice with ispCP when it’s slow at work.

I get productive when I ignore my games.

Getting hpasm installed on Ubuntu server

While installing Ubuntu Server 8.04 beta on an HP DL-320, I discovered I had some trouble getting HP’s “Proliant value added software” (hpasm) package installed. This package contains their system health check and control software which, among other things, switches the fans from “full-time full speed” (which is quite noisy) to temperature controlled speed (eg: normal (read: quiet) fan speed when system temp is normal).
The problem with installing and runnning this software stems from the fact that Ubuntu, for some reason, links /bin/sh to dash instead of bash. Dash is another bourne shell clone, but doesn’t understand Bash (bourne-again shell) specific syntax.
Re-linking /bin/sh to bash instead of dash solved the problem and the server is now humming (quietly) along.