Gosh this sucks

My post to leaplist earlier this morning:

Well, this is fun.

I had a drive reset on my ATA RAID 1+0 this afternoon. The 3Ware card
detected it and marked it as a dead drive. I actually touched the power
connector in question and that’s when it reset. Anyway, I rebooted and told
the 3Ware the drive was ‘ok’. It rebuilt, or so I _thought_, without
incident:

SCSI ID 1 3ware 3W-6410 disk controller
Array Unit 0 Striped Mirrors 128K (RAID 10) 239.99 GB Rebuilding
Subunit 0 Mirror (RAID 1) Rebuilding 7%
Port 0 WDC WD1200JB-32EVA0 120.3 GB Not in Service: Rebuilding
Port 3 WDC WD1200JB-75EVA0 120.0 GB OK
Subunit 1 Mirror (RAID 1) OK
Port 1 WDC WD1200JB-00DUA3 120.3 GB OK
Port 2 WDC WD1200JB-00DUA3 120.3 GB OK

Now, I’m noticing integrity issues with my data. Files are damaged in strange
ways. For instance, MP3 files I have now have strange pops. That sort of
thing.

The whole reason I was running RAID 1+0, of all things, was to afford myself
some kind of protection for drive failures.

So what happened? Why is my data DoA now?

The partition in question is running on top of XFS, which appears to have been
effected too. Some files are now ‘inaccessable’ even though they show up in
file listings. The last time I experienced this I was running ReiserFS and
it was a sure sign of a corrupt filesystem. I imagine that’s the case here
too. I didn’t notice it until I started running `md5sum` against all my
files, so I’ll have to wait to reboot and run `xfs_repair` to see what it
tells me.

Anyway, it would figure re: my last emails that my backup server has been
acting a bit strange after I ‘moved’ it into its home. Essentially, I moved
the OS drive and the 3Ware 6200 with 2 x 120GB RAID 0, my backup array. For
the moment, 3w-xxxx has stopped randomly giving me random PCI Reset nonsense,
although I did nothing that should have resolved it. I’m running an `md5sum`
against the last known good backup I have of my data array on the RAID 1+0
array. Assuming I can actually trust the data on my backup server, I should
be able to identify which relatively static files (mp3s, ect) have different
signatures but should not. That’ll confirm that this data corruption is not
just my imagination, which I’m fairly certain it is not.

So, I was just curious if anyone could tell me what the heck is going on?

Something failed me, but I’m not quite sure what it is. Did the 3Ware mess up
rebuilding the RAID 1+0 array? Maybe when the drive reset it actually became
subtly messed up? Maybe XFS busted when the drive that reset suddenly
disappeared from beneath it?

Most importantly, how do you detect this kind of nonsense? It seems until you
access a document and its messed up or you try to untar a file and tar tells
you the file is busted, you may not even notice anything is wrong.

I’d sleep better at night if someone had a plausible explanation for all/some
of this.

Thanks.