Day in the life of a Systems Administrator

Day in the life of a Unix Systems Administrator

Wow, been almost a year since I blogged anything. I’m getting lazy.

So what’s the daily life of a systems administrator like? Here was today:

Plan coming in in the morning: Begin quarterly “Vulnerability audit report”.

What did I do?
Windows server starts alerting on CPU at midnight, again. We fixed the problem on Tues. Why is it alerting again?
Of course it corrects itself before I can get logged in and doesn’t go off again all day. Send email to person responsible for the application on that server to ask if the app was running any unusually cpu intensive jobs. Respond with screenshot showing times CPU alerts went off. Get response of “nothing unusual”. As usual.

We updated the root password on all Unix servers last week. Get a list of 44 systems from coworker that still have the old root password.
Check the list, confirm all still have old root password.
Check the list against systems that were updated via Ansible. All on the Ansible list. No failures when running the Ansible playbook to update the root password. All spot-checks that the new root password was in effect at the time showed task was working as expected.
Begin investigating why these systems still have the old root password.
Speculation during team scrum that Puppet might be resetting the root password.
Begin testing hypothesis that root password was, in fact, changed, but something else is re-setting it back to the old password.
Manually update root password on one host. Monitor /etc/shadow to see if it changes again after setting password. (watch -d ls -l /etc/shadow)
Wait some more.
Wait 27 minutes, BOOM! /etc/shadow gets touched.
Investigate to see if Puppet is the culprit. I know nothing about Puppet. I’m an Ansible guy. The puppet guy (who knows just enough to have set up the server and built some manifests and get Puppet to update root the last time the root password was changed, before I started working here.) is out today.
Look at log files in /var/log. Look at files in /etc/puppet on puppet server. Try to find anything that mentions “passw(or)?d&&root” (did I mention I’m not a puppet guy?). Find a manifest that says something about setting the root password, but it references a variable. Can’t find where the value of that variable is set.
Look some more at the target host. See in log files that it’s failing to talk to the Puppet server, so continuing to enforce the last set of configuration stuff it got. Great, fixing this on the Puppet server won’t necessarily fix all the clients that have been allowed to lose connectivity that no one noticed (entropy can be a bitch.)
Begin looking at what to change on the client (other than just “shut down the Puppet service” and “kill it with fire!”). Realize it’s much faster to surf all the files and directories involved with “mc”.
Midnight Commander not installed. Simple enough, “yum install mc”.
Yum: “What, you want to install something in the base RHEL repo? HAH! Entropy, baby! I have no idea what’s in the base repo.”.
Me: “Hold my beer.” (This is Texas, y’all.)
(No, not really. CTO frowns on drinking during work hours, or drinking while logged into production systems. Or just drinking while logged in…)
OK, so more like:
“Hold my Diet Coke.”
Yum: “Red Hat repos? We don’t need no steeeenking Red Hat repos!”

Start updating Yum repo cache. Run out of space in /var. Discover when this server was built, it was built with much too small a /var. Start looking at what to clean up.
Fix logrotate to compress log files when it rotates them, manually compress old log files.
/var/lib/clamav is one of the larger directories. Oh, look, several failed DB updates that never got cleaned up.
Clean up directory, run freshclam. Gee, clamav DB downloads sure are taking a long time given that it’s got a GigE connection to the local DatabaseMirror. Check Freshclam config. Yup, local mirror is configured… external mirror ALSO configured. Dang it. Fix that. ClamAV DB updates no much faster.
Run yum repo cache update again. Run out of disk space again. Wait… why didn’t Nagios alert that /var was full?
Oh, look, when /var was made a separate partition, no on updated Nagios to monitor it.
Log into Nagios server to update config file for this host. Check changes into Git. Discover there have been a number of other Nagios changes lately that haven’t been checked into Git. Spend half an hour running git status / diff / add / delete / commit / push to get all changes checked into Git repo.
Restart nagios server (it doesn’t like reloads. Every once in a while it goes bonkers and sends out “The sky is falling! ALL services on ALL servers are down! Run for your lives! The End is nigh!” if you try a simple reload.
Hmm… if Nagios is out of date for this host, is Cacti…
Update yum cache again. Run out of disk space again.
Good thing this is a VM, with LVM. Add another drive in vSphere, pvcreate, swing your partner, vgextend, lvresize -r, do-si-do!
yum repo cache update… FINALLY!
What was I doing again? Oh, right, install Midnight Commander…
Why? Oh yeah, searching for a Puppet file for….?
Right, root password override.

Every time I log into a server it seems like I find a half dozen things that need fixing. Makes you not want to log into anything, so you can actually get some work done. Oh, right, entropy…

Adding my network to Cacti

Geeking with Cacti.

So, geeking out this evening, adding my entire home network infrastructure to Cacti, to track how it’s doing.
I’d already set up all my VM’s, the Cisco router and Uverse gateway, and my two hosted servers at Rackspace and Linode months ago.
Tonight I added my ESXi server and both Cisco switches. Of course, not much to see on most of the switch ports, since the only port in use on one of them is the uplink to the other switch (which means the only traffic on that port is Cacti polling it’s SNMP daemon). But it’s interesting, none the less.
I’ll probably do the same on the Cisco lab I build for CCNA study.

Fixing Vmware virtual disks

Having hosed a Gentoo guest on a VMware ESXi host by filling the partition (which VMware really doesn’t like) then attempting to fix it by mounting the partition in anther guest and fsck’ing it first, I got the error message “the parent virtual disk has been modified since the child was created” when I tried to boot the original Gentoo guest.
Googling pointed me to a nice post at Recovering VMware snapshot after parent changed.
Step two lists the following caveat:

“Look at the size of the snapshot virtual hard disk. If it is more than 2GB and you’re running a 32-bit OS, or it is more than the amount of memory that you have available, the following method will probably not work. You’re welcome to try though.”

I found this wasn’t an issue as it appears (at least as of ESXi 4.x) VMware has separated the vmdk “header” and “data”, putting the “header” in the “hostname.vmdk” file and the actual data in “hostname-flat.vmdk”. The original vmdk is now only a couple of hundred bytes and easily edited in vi. Grabbing the CID from the Gentoo.vmdk and modifying parentCID in Gentoo000001.vmdk had me back up and running (at least to the point that I could now boot the Gentoo guest, using an Ubuntu ISO so I could access the file system and clean it up. I moved /home to a new partition, fixing the space issue).
Next time, I’ll just be smart and build all systems with LVM, then I can just add more physical extents when I need more space.

VMware ESXi on USB thumb drive

Running Dog Leaugue has a good write up on how to install VMWare ESXi on a thumb drive.
With this I was able to get it up and running on a Dell PowerEdge 850 that would NOT install ESXi from a CD (couldn’t find a storage device to install to).