Monday, September 25, 2006

Stability Can Be Stressful

Last week I got the bright idea to add RAID-1 (mirroring) to the pile of moving parts on my intranet / backup server in the home network.

Adding RAID to the kernel and configuring root-on-raid went quite smoothly. It was probably because of this that I got the bright idea to do LVM-on-raid on a Thursday evening instead of over a long weekend.

Migrating LVM started by creating md1 on hdc4 via mkraid. This was then added to LVM through vgextend. (hdd is not yet in the case and that does complicate things a bit.) Now I needed to move the data from hda4 onto md1. This is where things began to go very badly...

For reference there appears to be a bug in pvmove 2.02.05. pvmove will move data from one physical volume in a volume group onto another physical volume. In my case, I wanted to move from hda4 to md1 so that I could then add hda4 to md1 thus mirroring the data across both hda4 and hdc4.

Keep in mind that the whole point of the operation is to add stability and reliability to the system...

Due to the (alleged) bug I got kernel panics and a solid HDD activity light. No chance to reboot, complete system hang on any attempt. The only solution was hardware reset.

Now you have to realize that this box is the backup- and file-sever for my home network. It has backups of all of the other hosts and some data that isn't found anywhere else in the network. As it turns out the most critical (and irreplacable) logical volumes live(d) physically on hda4.

After the hard reboot I decided that even though I had cross-host backups it might be wise to do another of these critical bits. Unfortunately, pvmove was still trying to do its job even after the reboot and any attempt to read the in-move filesystems resulted in yet another kernel Oops and solid HDD activity.

Hours transpire during which I try any number of things to cancel the pvmove (including pvmove --abort) to no avail. I finally discovered that there is a version 2.02.06 of pvmove and immediately upgraded. To my horror I could no longer even identify the in-move logical volumes. Evil ioctl error messages accompanied any attempt. Fortunately, at this point an --abort did work and I was finally able to see everything.

I decided then and there that pvmove is not my friend and I don't want to associate with it any longer. I did what any sensible person would do... I created parallel filesystems for each one I wanted to move and used my old, reliable friend rsync followed by lvremove of the old volumes and lvrename of the parallels. And you know what? It all worked perfectly.

So this is a quick post covering about three and a half hours of sheer terror. Now that its all done (and hdd to be added very soon) I'm quite happy with the results. I feel much more confident that I can loose any one of the four drives and not loose a bit of data. I'm glad to be where I am. I just wish the road had been a bit smoother...

(And, oh yea, I'm rsync'ing the latest backups to my dev box every night and have no plans to stop...)

No comments: